Timing in media systems is critical and needed to keep pictures stable and sound glitch free.
The Human Visual system (HVS) describes the interaction of the eye and brain to provide our experience of vision, and a similar interaction between the ear and brain provides hearing.
Common to video and audio is the need to convert digital to analogue signals so we can hear and see them. A television screen is an analogue device that provides images for the eye, and a loudspeaker is an analogue to provide sound for our ears.
Humans are very good at detecting discontinuities in the audio-visual domain. For example, a small noise in the dead of night can easily wake us, or a strange movement in our peripheral vision alerts us to a potential danger. All throw-backs to our caveman days when danger lurked in the shadows as predators would consider us a healthy meal.
Analogue video has embedded timing signals built into it in the form of frame, field and line syncs, and color subcarrier frequencies. Sync information was originally required to keep the electron beam synchronous between the cathode ray tube television and camera.
As we moved to uncompressed digital, the sync information still existed but found other uses. Digitized audio was often embedded in the redundant sync area of the picture to allow audio to be sent on the same cable as the video, a system often used by broadcasters.
Only when we moved to compressed digital feeds did synchronizing information become fully optimized. There had to be some synchronization information, but the amount of data dedicated to it was significantly reduced.
Pre-IP systems used dedicated cables with guaranteed bandwidth, so timing and latency wasn’t an issue. However, when broadcasting moved to IP new challenges started to occur.
By transferring video and audio over IP we are essentially sending a synchronous system over an asynchronous network. Timing information broadcasters took for granted, is removed and new ways of synchronizing video frames and audio samples had to be found.
Furthermore, the timing relationship between video and audio must be maintained to keep lip-sync, the phenomena where spoken words can be heard before or after the lips move. If lip-sync errors of more than three or four frames of video occur, viewers find it difficult to watch the program.
Real Time Protocol (RTP) was originally used to provide basic synchronization, but this was insufficient and new methods were developed.
The Society of Motion Picture and Television Engineers (SMPTE), who are responsible for creating many broadcast video, audio and metadata specifications, developed the SMPTE 2022 specification. In effect this merely packetized digitized video and audio channels, although it worked, it was inefficient as redundant line, field and frame sync pulses can be better represented in digital systems.
SMPTE-2110 provides true IP packetization for video, audio and metadata. To guarantee timing relationships are maintained, 2110 uses the Precision Time Protocol – IEEE 1588. Each video frame and group of audio samples are given a unique time value derived from the senders’ master clock. Any system receiving IP packets can re-align and synchronize them to provide a time-perfect representation of the original audio or video, thus displaying stable pictures, glitch free audio and error free lip-sync.