Standards Conversion Part 1

In the post “video frames – introduction”, we looked at the two dominant frame rates used throughout the world today – 29.97 and 25 frames per second. Both formats were defined by engineers in the USA and Europe at the dawn of television in the 1930’s. And a need to maintain backwards compatibility has subsequently ensured these rates are still dominant today.

If we were re-designing television, without the need for backwards compatibility, we would do it much differently. However, we are saddled with 29.97 and 25, and despite the many opportunities to change the frame rates, for reasons beyond engineering logic, we are stuck with them for the foreseeable future.

To enable European viewers to watch American programming, and vice versa, the two formats must be changed to the format the viewers television is working in. There are some multi-standard televisions available that can detect and change between 29.97 and 25, but in the mainstream, televisions will only work at one format. Also, broadcast workflows generally only work in one format, and set-top-boxes suffer the same restriction.

Image size is different between the two formats, America uses a system based on 525 lines, and Europe 625 lines. The horizontal widths are different too, America uses 63.6 micro-seconds, and Europe 64 micro-seconds. The differences may be subtle, but they have huge effects and ramifications for conversion.

Horizontal and vertical scaling is a relatively straight forward process for modern computers. Up and down conversion algorithms are easily designed and implemented.

Converting between frame and hence field rates is an enormous task and the quality of conversion varies depending on the vendor used. The main challenge to overcome is that there is no easy integer relationship between the two rates. Technically, the two frame rates do repeat synchronously every 50 seconds, or 1500 USA frames, or 1250 European frames.

If the images had no motion, then the conversion would be easier. The challenges start to emerge when we see motion in the broadcast. Our human-visual-eye system is very good at detecting jitter and discontinuities in motion, a throw-back from our ancestors when we needed to detect predators lurking in the dark. Consequently, when converting between rates we must determine the motion and content between frame samples.

Motion estimation is one solution to temporal determination and content of frames. Algorithms analyze frames of images to identify motion of objects, and then determine where the image would lie in the next frame of the format being converted to. Intuitively, the more frames we must analyze the more accurate will be our motion estimation and hence construction of the new frame.

Real challenges in computing power and input/output limitations start to emerge as we increase the number of frames being analyzed. At high definition rates, a video frame could easily be 60 Mbits, and one second would be 1.8 Gbits. Processing many different images in real-time causes havoc with computing systems, not to mention the latency and potential lip sync errors introduced.

Converting between the two dominant frame rates of 29.97 and 25 frames per second is a difficult and complicated task, the success of which is usually related to the cost of the solution.

Timing – Introduction

Timing in media systems is critical and needed to keep pictures stable and sound glitch free.

The Human Visual system (HVS) describes the interaction of the eye and brain to provide our experience of vision, and a similar interaction between the ear and brain provides hearing.

Common to video and audio is the need to convert digital to analogue signals so we can hear and see them. A television screen is an analogue device that provides images for the eye, and a loudspeaker is an analogue to provide sound for our ears.

Humans are very good at detecting discontinuities in the audio-visual domain. For example, a small noise in the dead of night can easily wake us, or a strange movement in our peripheral vision alerts us to a potential danger. All throw-backs to our caveman days when danger lurked in the shadows as predators would consider us a healthy meal.

Analogue video has embedded timing signals built into it in the form of frame, field and line syncs, and color subcarrier frequencies. Sync information was originally required to keep the electron beam synchronous between the cathode ray tube television and camera.

As we moved to uncompressed digital, the sync information still existed but found other uses. Digitized audio was often embedded in the redundant sync area of the picture to allow audio to be sent on the same cable as the video, a system often used by broadcasters.

Only when we moved to compressed digital feeds did synchronizing information become fully optimized. There had to be some synchronization information, but the amount of data dedicated to it was significantly reduced.

Pre-IP systems used dedicated cables with guaranteed bandwidth, so timing and latency wasn’t an issue. However, when broadcasting moved to IP new challenges started to occur.

By transferring video and audio over IP we are essentially sending a synchronous system over an asynchronous network. Timing information broadcasters took for granted, is removed and new ways of synchronizing video frames and audio samples had to be found.

Furthermore, the timing relationship between video and audio must be maintained to keep lip-sync, the phenomena where spoken words can be heard before or after the lips move. If lip-sync errors of more than three or four frames of video occur, viewers find it difficult to watch the program.

Real Time Protocol (RTP) was originally used to provide basic synchronization, but this was insufficient and new methods were developed.

The Society of Motion Picture and Television Engineers (SMPTE), who are responsible for creating many broadcast video, audio and metadata specifications, developed the SMPTE 2022 specification. In effect this merely packetized digitized video and audio channels, although it worked, it was inefficient as redundant line, field and frame sync pulses can be better represented in digital systems.

SMPTE-2110 provides true IP packetization for video, audio and metadata. To guarantee timing relationships are maintained, 2110 uses the Precision Time Protocol – IEEE 1588. Each video frame and group of audio samples are given a unique time value derived from the senders’ master clock. Any system receiving IP packets can re-align and synchronize them to provide a time-perfect representation of the original audio or video, thus displaying stable pictures, glitch free audio and error free lip-sync.

Video Lines – Introduction

There are no moving pictures in television, just individual frames played back at sufficient speed to give the perception of motion. The post Video Frames Introduction, demonstrated how video is divided into individual frames, fields and the frequencies used, in this post we will look at how and why fields are split into lines.

Two receptors dominate our vision; rods and cones. Each eye contains approximately ten million cones and ten billion rods.

Cones are clustered around the center of the eye, have good acuity and can see color. Rods, greater in number and dispersed around the eye, are more sensitive to low level light, and only detect black-and-white images.

Cones are of greater interest to us – viewers normally look straight-on to the color viewing screen and are therefore using the center of the eye, which excites the cones more and allows us to see greater detail and color.

Rods cannot be ignored – they can detect flicker if the scan rate of the television or monitor is not high enough, hence the reason we use interlace to increase the frame rate.

Tests have shown, when stood 20 feet (9.5 meters) from a chart, the average human eye is just about able to resolve two lines drawn 1/16 inch (1.75mm) apart. However, this varies enormously from person to person due to the variance in our eyesight. But this value has been found to be a good average.

Before digital monitors and televisions, cathode ray tubes were used to display images. They were heavy, cumbersome, glass devices with magnetic coils around them used to deflect the electron beam to trace out horizontal lines. Each line was scanned below the previous causing the image to appear as the scan moved down the screen.

Early broadcasts used 50, 100 or 405 lines, but the standard definition of the 1950’s and 1960’s provided 525 lines for the USA and 625 for UK and Europe.

Modern LCD, plasma and LED screens use a matrix pattern of pixels, but these are still based on the horizontal line method developed for CRT’s.

Cameras operate in an analogous way to televisions, but in reverse. During the analogue CRT era, cameras used tubes, but the face was light sensitive. An electron beam scanned across the inside of the face and a current flowed that was proportional to the brightness falling on it, over a frame this provide the video signal. Electromagnetic coils around the tubes provided the beam scan and a picture of 525 or 625 lines was created.

Modern cameras using CMOS and CCD image gathering sensors use the same matrix pattern of LCD screens, and are based on the line system of the tube cameras.

Increasing the number of lines will increase resolution to a point. If the number of lines is increased too much, then the eye will not be able to resolve the increased resolution. However, if we increase the physical size of the television, or move closer to it, then you will see the increased resolution. The same theory applies to modern LCD, plasma and LED screens.

As technology moves to UHD-4K and 8K, we will need bigger screens, and to take full advantage of their resolution we will have to sit closer to them in our homes. But not too close as the increased brightness will damage your eyes.

Video Frames – Introduction

Video Frames

Frames are fundamental to the operation of video – there are no moving pictures in television, just individual frames that are played back with sufficient speed to give the perception of motion.

The frequency at which frames are played is fixed. In the UK and Europe the playback rate is 25 frames per second, and in the USA and countries using the NTSC system it’s 29.97 frames per second.

Video has its roots in film, and in the early days of the silver screen, researchers discovered that 24 frames per second was the slowest a film could be played whilst maintaining fluidity of motion.

Engineers discovered that playing a film back at 24 frames per second caused a visually unacceptable flicker, and to fix this they flashed the projection bulb twice for every frame displayed to increase the rate to 48 frames per second. The result was fluidity of motion without flicker.

Television engineers needed to replicate the film system as playing video at 25 or 29.97 frames per second also caused a similar unacceptable flicker. But doubling the frame rate would have doubled the frequency bandwidth required, resulting in fewer channels being broadcast, and the electronics at the time could not easily work at the higher frequencies.

Instead of doubling the frame rate, television engineers invented interlace – the number of frames is doubled, but each frame uses half the number of lines. A video line is similar to a row of pixels and is described in a later article.  Frames are paired into two fields, field one represents the odd number lines, and field two represents the even numbered lines.

When fields are played back in a television set they are interleaved together, and by averaging over two fields we see one complete frame. Consequently, the field rate of UK and Europe is 50 fields per second and 25 frames per second. And the field rate of the USA is 59.94 fields per second, or 29.97 frames per second.

The field rates were originally chosen so camera scanning systems could be synchronized to A.C electrical frequencies. Before color was invented, USA used 60 fields per second and 30 frames per second. The advent of color caused an interference which manifested itself as flicker, a problem that is described in a later article. Without synchronizing the camera’s to the electrical A.C frequency, a strobing interference between the studio lights and video output of the camera could be seen.

Modern formats use a system called progressive, represented as “P”, such as “1080P” or “720P”, this removes interlace by doubling the frame rate. A USA system that broadcast at 59.97 fields per second, or 29.94 frames per second, could now transmit at 59.97 frames per second. This has shown to improve the fluidity of motion even more and there is no perceivable flicker. Higher frame rates are being introduced which improve fluidity of motion even more.

Compression and digital transmission allows much higher frame rates to be used than in the old analogue days of terrestrial broadcast.

The 1930’s was the birth of television and a great deal of research was undertaken during this time. As broadcasting needs to be forever backwards compatible, the decisions made at this time are still with us now.