If you’re one of the 24 million or so people around the world who purchased a 3-D television in 2011—or the 42 million doing so this year—you might know what it takes to watch 3-D on your home television: a pair of 3-D glasses. But you likely haven’t thought much about what it takes to produce the 3-D spectacle that comes to life in your living room. More cameras, or at least more lenses, you might think, and that’s probably about it.
In fact, that’s not it—particularly when it comes to producing the kind of 3-D show that people watch more than anything else: a live sports broadcast. Making a 3-D program work well—especially when it’s happening live—is one of the greatest and most interesting technological challenges facing TV production at the moment.
The biggest difference introduced by 3-D involves the viewer’s relationship to a shot. With 2-D, the viewer sees what’s happening but feels separated from it. A 3-D image can blur the line between the “audience space” (where you are) and “scene space” (what the cameras see). Instead of looking through a window, you feel as though you’re standing on a sideline. That means that when a 100-kilogram athlete speeds toward you, you are likely to duck.
And producers want you to duck: The point of 3-D TV is to make you forget you’re in your living room. When it works well, it’s amazing, but any little mistake that breaks the illusion ends up being not just a minor annoyance but an extreme disruption that can literally give you a headache.
These days, most of us who produce 3-D sports usually do it well, but not always, and we’ve had to learn a lot in the past few years. I’m going to take you behind the scenes and show you what we’ve learned and what we have yet to figure out. So the next time you watch 3-D sports on TV, you’ll see what the producers of that broadcast do right, understand some of the choices they make during the broadcast, and maybe even spot a few errors.
I’m going to use the United States’ National Football League (NFL) as the archetypal sport for TV coverage. Other broadcast sports, like soccer and hockey, share many of the same challenges of an American football broadcast, such as the shooting angle, the lighting issues, and the camera placement. Other challenges are unique to certain sports: For example, the all-white surface of an ice rink has no texture or color variation, both of which contribute to conveying a sense of depth.
First, let’s look at how a sports network covers a football game in 2-D.
In 2011, the NFL Network used either 26 or 27 cameras for its game coverage; most other networks covering football fielded about the same number. The operators of those cameras work independently for the most part, looking for interesting shots on their own. The instructions they do get from the director tend to come in the form of a general request: “Find me some facial expressions” or “Follow number 45.”
The directors take the feeds from these cameras and choose what shots to include in the broadcast, typically picking a new shot every 2 to 5 seconds. Directors usually start with wide shots to establish a context for the viewer and then make their way to tighter shots.
The production team mixes those camera feeds with replays, prerecorded footage, and computer-generated graphic elements, including the game clock, period, score, and game context (in the case of American football, the down, yards to go, and current ball position). Most broadcasts also include a branding “bug” of some type—for example, the ESPN logo. They also tap augmented reality tools, like a telestrator (which lets a commentator sketch over a moving image), a virtual first-down line, and virtual signage. And, of course, various canned transitions like wipes and fades separate different shots, graphics, and other elements that don’t easily flow together.
Such is the world of 2-D NFL broadcasting as we know it today. And audiences want it all. Skip the virtual first-down line or the virtual scoreboard and many viewers will simply write off the show as impossible to watch. Audiences are surprisingly inflexible and unforgiving.
Enter 3-D. At first glance, it seems as though you could simply take a 2-D sports broadcast and add depth. But it’s not that simple.
Take the most common 2-D editing technique, the fast cut. In 2-D, a fast cut simply changes what picture is in front of you, the viewer. You see a close-up, perhaps, then a wide-angle shot. But you don’t feel as if you’ve actually moved; you’re just looking at a different picture. In 3-D, a fast cut seems to relocate you to a completely different place. In one shot, you may feel as if you are looking across a wide expanse; in another, you may feel as if you’re close enough to touch the player. Everything about your perspective—whether the image appears to jut out from the screen, recede into it, or sit near the surface of the TV—changes with each cut, and that can take some time to adjust to. If you don’t have enough time, you feel disoriented. So directors of a 3-D broadcast tend to cut between shots less than a fifth as often as they do for 2-D. The good news for directors is that they need about half as many camera positions. In 2-D, all those different camera angles make up for the inherent loss of excitement that comes from the loss of immersion. In 3-D, you have no need to compensate. It turns out, though, that producers still need the same number of cameras, because you need two cameras for each 3-D shot.
Some of the graphic enhancements sports viewers have grown accustomed to are hard to pull off in 3-D. Consider the virtual line of scrimmage in American football—the computer-created line marking the distance reached by the football in the previous play. Today’s broadcasts superimpose the line on the picture, making it look as though it’s been chalked right onto the grass of the playing field. It takes about 5 gigaflops to determine how to display it in a 2-D scene, but it takes a thousand times as many computations—about 5 teraflops—to display it in 3-D. That’s because painting a virtual line in 2-D requires simply finding the two end points of the line in the video image and then drawing the line between them in perspective. The system determines what is grass and what is player and then creates a line that covers the grass but falls behind the players. It doesn’t need to follow the curves of the grass precisely, because from the viewer’s perspective, the entire image is flattened, and as long as the line blocks the grass and not the players, it will appear painted on the grass. In 3-D, however, the image is not flattened, so you can’t just obscure the grass. The system must track the grass geometrically, following its every curve, or the line might seem to be floating above the ground where people could trip over it. That precise tracking takes hundreds of x-y coordinates, not just two.
In the same way we need fewer camera positions, we don’t need all the fast cuts and as many graphics, for in the 2-D world they exist to compensate for the inherent loss of visual richness and energy compared with watching an event live. The 3-D coverage itself provides much of that extra visual richness and energy. Admittedly, sometimes we cut back on complexity simply because we have yet to create good 3-D equivalents of 2-D. However, in most cases it’s because we simply don’t need them.
Giving up on fast cuts and trimming down the graphics aren’t the only changes producers are making in switching to 3-D. In general, camera operators shoot 2-D sports from high up in the stadium, but 3-D is more compelling when shot from the ground, simulating an on-site spectator’s point of view. At that level, it’s hard to find places to put cameras, and stationary objects as well as moving people tend to block the view. But whereas a 2-D camera views a scene shot from above as a simple rectangle, with a height and width that don’t change when the camera pans, in 3-D the addition of depth means that the camera captures a prism. Even though the TV screen displays this prism as a rectangle, from the viewer’s perspective all three dimensions of the prism change relative lengths as the camera moves. Trying to mentally adjust to such changes disturbs and disorients viewers, and shooting from a lower angle minimizes these changes.
These adjustments—slower cuts, fewer graphics, and a lower viewing angle—are straightforward and relatively simple to do. But that’s not all it takes to make a 3-D broadcast work.
The images in a 3-D broadcast must be of higher technical quality than those in 2-D productions, where a bit of sloppiness won’t be noticed. For instance, if the color of the grass as it appears on the screen doesn’t quite match reality, viewers will overlook the defect because they have no handy reference for comparison. But in 3-D, every shot has a reference: the other eye. Color has to match. Focus has to match. Zoom, camera angle, signal quality, and graphics—every single thing that’s done has to be exactly right, which it often isn’t in 2-D. Take the 2012 Super Bowl (the NFL championship game). One camera had a smudge on the lens. Viewers noticed it, but it wasn’t a showstopper; in 3-D, that camera would have been unusable.
And it’s not just a matter of annoying the viewer. A bad 3-D shot can actually be painful. There are two types of technical mistakes that can make you physically ill. First, cameras can be misaligned—for example, the image sent to the left eye is 10 pixels higher than the image going to the right eye. When you’re watching a broadcast and this happens, one eye has to point upward more than the other, and in many cases, the strain that causes will give you a headache. Then there are mismatched depth cues. Depth perception comes from a number of different aspects of an image. Parallax cues, for example, let the brain extract depth information from the different viewing angle perceived by each eye. Occlusion cues let the brain calculate depth based on which objects overlap other objects. If these two kinds of cues don’t match—for example, if parallax cues place the ball in front of the screen while occlusion cues place it behind—you’ll tend to feel nauseated. In both cases, the egregiousness of the error determines the severity of the headache or nausea.
Getting a perfect shot every time at a live sporting event is tough. What makes it even tougher are zoom lenses, used extensively in sports coverage today. A zoom lens is made up of a large number of moving glass elements. No two zoom lenses will match at every point in the zoom range, but 3-D requires that they come much closer to perfect than before. That’s why production teams carefully test lenses to find the best match.
Then there’s the question of where to put graphics in 3-D space. In the 2-D world, most graphics live in an artificial space. You would say they’re “in front” of the other objects on the screen, but they aren’t really. They’re at the same depth, that is, flat; they simply block other objects in the way a sticker on a book cover blocks part of the photo.
Take something as simple as a scoreboard. In a 2-D broadcast, producers generally place the graphic score block in a consistent position on the screen—for example, the upper third, regardless of the underlying content. It just stays there, blocking part of every scene. The same technique does not work in 3-D; everything must be placed at some depth. As a result, producers have to consider something we call the depth budget.
To view things moving around in the third dimension, your eyes have to look straight ahead for some objects and cross for others; this causes fatigue. Situations that force the eyes to diverge are even more uncomfortable, so producers try to avoid those altogether. Producers use depth budgets to limit the eye motion to a comfortable range. What is comfortable is subjective and can evolve; that’s why most 3-D shows use less depth in the beginning and add more later after the audiences’ eyes have “warmed up.” Viewers can’t tolerate shots with lots of depth disparity for long stretches of time. Good 3-D shows therefore include “rest stops,” shots with little depth to give the viewer’s eyes a rest.
To keep viewers from straining their eyes, producers of 3-D content must consider the limit to the depth that a viewer can comfortably perceive in a 3-D television scene. This limit is set by several factors, but the most important are the size of the TV screen and the viewing distance. For objects in the foreground, the closer an object appears to a viewer, the greater the difference in the view presented to the left and right eye. This difference is measured as a linear distance—the physical difference between the left- and right-eye views—and that distance is in turn expressed as a percentage of the width of the television screen. Television producers consider a reasonable limit for the living room, where the typical viewing distance is about 1.8 meters, to be about 4 percent of the screen width for objects perceived as being very close to the viewer and about 2 percent of the screen width for objects receding into the background. For foreground objects on a 55-inch diagonal screen, which is about 1.2 meters wide, that translates to about 5 centimeters between the pixels seen by each eye.
Even within the depth limits, it’s tough for viewers to focus on near and far objects simultaneously. Here the depth budget comes into play: When you have something deep in the background, as you would in the kickoff of a football game, you are limited in how far foreground objects extend toward the viewer. So in scenes like these, directors or camera operators must frame shots without objects in the extreme foreground.
Now let’s get back to the issue of graphics. Unlike a 2-D image, where the directors can place graphics on the top or bottom of the screen and rarely block important action, a 3-D image has no position where the graphic is guaranteed not to collide with the scene. While the safest depth is in front of the screen, the farther in front of the screen an object appears, the more of the depth budget it uses.
Another technique 2-D producers use for graphic overlays is partial transparency: Making graphics see-through helps integrate them into the live scene. In 3-D, this just doesn’t work. Focus, intensity, brightness, and color all provide depth information to the viewer. Partial transparency modifies each of these elements, which can break the 3-D illusion.
Maintaining the illusion of 3-D isn’t tricky just when producers are mixing graphics into a scene. They have to manage the live footage carefully as well.
If a football player is, say, in front of the convergence point of the two eye views being recorded by the cameras on the field, he should appear to be in front of the screen. And if he’s in the middle of the shot, indeed he will. But if he’s at the edge of the scene for you at home, he will appear to touch the edge of the TV itself. If that happens, your mind will no longer allow you to see him in front of the screen. If he’s touching the side of the TV, then you will see him aligned in depth with the TV frame, breaking the illusion of 3-D and startling you.
Alignment, managing the depth budget, watching out for edge problems, and placing graphics carefully are basic 3-D production issues. They are essentially mathematical problems that can be handled by software, like that developed by my company, 3ality Technica. People carefully monitoring and adjusting the graphics and camera feeds can make these adjustments manually. That’s tough to do in live television, however, so when not using automated tools, directors typically cut away from problematic shots as quickly as they possibly can.
But not all the problems of 3-D are mathematical. It gets much more complicated when producers combine images from many sources. Remember, each shot has a different trapezoidal 3-D geometry, and each one uses its depth budget differently. A commonly used composition in football coverage is a head-to-chest shot of two announcers, followed by a cut to a high wide shot of the entire field. Each of these individual shots can easily be done well in 3-D. However, the former generally does not have much spatial volume, while the latter feels very large. In the real world, such a transition doesn’t happen instantly. You may be talking to someone next to you, but you don’t close your eyes, spin, and open them to see the Grand Canyon. If you did, you might experience just what you would if you saw such a fast transition on a 3-D screen—instant vertigo.
So the more different the geometries of two shots are, the longer the transition between them must be. For example, a producer might insert a midrange shot between that shot of two announcers and the view of the field. Or a computer system can mathematically analyze the two geometries, find (in essence) a common denominator, and then re-create each shot in the new geometry. As you might imagine, performing these operations live is computationally expensive. However, not doing so results in hard-to-watch 3-D.
Some 3-D productions start off with a “we can fix it in post” mentality—that is, in postproduction, after the cameras capture the footage and send it to the editors. Fixing problems during production hasn’t always been an option in live sports coverage, but today we do have automated systems that can fix some things nearly in real time. For example, most camera systems can’t be aligned dynamically in the field, so getting the two stereoscopic images from a camera in perfect alignment doesn’t always happen. But an image processor can quickly and automatically correct the most common alignment problems, like vertical shift, focus mismatch, or color mismatch. In the 2-D world, production teams do little postprocessing beyond color correction.
Unfortunately, although automated systems can correct a little vertical alignment mismatch without degrading the image quality much, correcting a focus mismatch is a bigger problem. In general, when matching two images, systems usually defocus the sharper image to match the softer image. You might think that with a stereoscopic camera—in which two almost identical images of the same scene are captured at the same time—information from the sharper image could be used to sharpen the softer image. However, it is precisely the subtle differences between the two images that create the sense of depth. Matching the two images in the wrong way destroys the depth information. With 3-D camera systems, it is far more effective in terms of cost, quality, and time to perfectly match cameras from the beginning and strive for perfect images.
In the early days of 3-D, producers threw people at all these technical problems. People at the cameras and in the control rooms managed the depth budget, alignment issues, graphics placement, and all the other complications of the 3-D broadcast world. These days, however, they’re replacing those people with automated systems. Such systems, though, also need more than just the video and audio streams that flow from the cameras to the control room and between the different systems in the control room. They need vast amounts of metadata, including time code, camera name, focus distance, relative vertical position, relative rotation, and color information. That’s a lot. The most advanced 3-D production systems available today allow for 256 channels of metadata. And, unlike video streams, metadata isn’t flowing in just one direction; the various production systems need to talk to one another.
This requirement for two-way communications is pushing the media production world toward the use of complex networks that route packet-switched data. This change also means production teams need a lot more computing power. Right now, a pair of cameras on the field requires about 2 teraflops of processing power in the control room. That will likely double in the next year or so as computer processing replaces mechanical correction by operators. Then it’ll double again in the following two years as automated processing capabilities increase.
Network bandwidth will also need to grow. Right now, each 3-D pair of cameras sends its video and audio streams and metadata at 3 gigabits per second. That will soon double because producers are moving to shooting sports broadcasts in a progressively scanned format that records a full 60 frames per second (50 in countries that follow the European standard). Today, producers use an interlaced format that records only half the frame (every other line) at that rate. Processing needs will double again as more metadata accompanies the video streams. Today’s cameras typically have 2K resolution—a bit less than 2048 horizontal pixels—but down the line, cameras with 4K resolution and even 8K resolution will push demands for more bandwidth even further.
Producing live 3-D broadcasts of sporting events has its challenges. But with the right equipment, knowledge, and experience, it should be neither more difficult nor more expensive to produce a 3-D show than a 2-D show. Over time, the equipment will get smaller, cheaper, and easier to use. A few years ago, a general-purpose 3-D camera system with two cameras and two lenses and some onboard signal processing capabilities weighed hundreds of kilograms and cost hundreds of thousands of dollars. Today, such a system weighs 15 kg and costs about US $100 000. In less than five years, an equally capable system will weigh less than 2.5 kg and cost a few thousand dollars.
The final, biggest challenge is not technical but psychological. The 3-D experience often isn’t what the audience expects. At 3ality Technica, one of the comments we heard most often from audience members after screenings of the concert movie U2 3-D was that they “didn’t know what to do.” Said one viewer, “I went to see a movie and ended up at a concert, but I couldn’t get up and dance or make noise.”
Sports broadcasts in 3-D face a similar problem: They make the remote viewing experience a lot more like the in-person viewing experience, which can seem odd to the viewer. The trick is to make the expectations, the viewing environment, and the content all support one another. When this happens, watching 3-D sports can be uniquely compelling. When it doesn’t, the experience can be just plain weird.
Broadcasters have done some experiments. Britain’s Sky Sports has had a lot of success showing soccer in pubs, in no small part because of the communal experience. That’s surprising, given that all viewers wear 3-D glasses, which would ordinarily make them feel like they’re each in a private bubble. To date, the best audience responses I have seen have been where venues created a hybrid environment by mixing in some aspects from the live experience with those of the “standard” remote viewing experience. At some remote football viewing venues, for example, seats are spaced more like those in a stadium, there is much more ambient light than you would expect in a theater, there are live cheerleaders to engage with the live audience—and of course, you can buy beer.
We can’t practically put live cheerleaders and bartenders in every living room. But we can figure out how to make the 3-D broadcast version of a live game give you everything you’ve learned to expect from 2-D broadcasts. The real challenge is, as it has always been, to engage the audience. In that sense, the challenge never changes.
About the Author
Howard Postley is the chief technology officer of 3ality Technica, a company that designs products for 3-D sports broadcasting. This 35-year veteran of the computer, communication, and media industries is an avid beach volleyball player and sailboat racer, but he doesn’t play American football—and he doubts any football fans would want to watch him in 3-D if he tried it.