Fundamentals 18 min read

How RTP and NTP Timestamps Enable Precise Audio‑Video Sync in WebRTC

This article explains the structure and generation of RTP timestamps for audio and video, the role of NTP timestamps as a common time base, how RTP and NTP are correlated through Sender Reports and linear regression, and the calculations used to achieve accurate audio‑video synchronization and target delay management in WebRTC.

OPPO Kernel Craftsman

Nov 22, 2024

How RTP and NTP Timestamps Enable Precise Audio‑Video Sync in WebRTC

RTP Timestamp Basics

RTP timestamps define the sampling moment of media payload data using a monotonically increasing clock whose precision is determined by the media’s sampling frequency. Audio typically uses 8 kHz, 16 kHz or 48 kHz, while video uses frame rates such as 20 fps, 24 fps or 30 fps. Each audio sample or video frame receives a timestamp that is placed in the RTP header; the unit of the timestamp depends on the stream’s sampling rate.

The RTP header consists of fields V (version, 2 bits), P (padding, 1 bit), X (extension, 1 bit), CC (CSRC count, 4 bits), M (marker, 1 bit), PT (payload type, 7 bits), sequence number (16 bits), timestamp (32 bits), SSRC (32 bits) and optional CSRC identifiers (0‑15 × 32 bits). The timestamp reflects the sampling instant of the first octet of the RTP payload.

Audio Timestamp Generation

For a 48 kHz audio stream, each 20 ms packet contains 960 samples, so the RTP timestamp increments by 960 per packet (20 ms × 48 kHz / 1000). A random offset is added when the audio frame is packed into an RTP packet. The generation process is illustrated in the following images:

Video Timestamp Generation

Video timestamps use a 90 kHz clock (unit = 1/90 000 s). The timestamp is derived from the system clock at the moment of frame capture, converted to NTP time, then to RTP time (RTP = NTP × 90). After encoding, a random offset is added before the frame is placed into an RTP packet. The flow is shown in the images below:

NTP Timestamp and Its Relationship to RTP

NTP timestamps count seconds since 1900‑01‑01 00:00:00 and serve as a universal time base for audio‑video synchronization. RTCP Sender Report (SR) packets contain both an RTP timestamp and the corresponding NTP timestamp, allowing receivers to map RTP timestamps to absolute time.

The conversion formula is linear: T_ntp = k·T_rtp + b, where k and b are derived from at least two SR packets. This linear relationship enables the receiver to compute NTP time for any RTP packet.

Audio‑Video Synchronization Principles

Synchronization ensures that audio and video playback progress at the same rate. Three common methods are: (1) sync audio to video, (2) sync video to audio, and (3) sync both to system time. In WebRTC, independent audio and video streams are aligned by converting RTP timestamps to NTP timestamps using the SR‑derived linear mapping.

Linear Regression for RTP‑NTP Mapping

After collecting multiple SR samples, the system performs linear regression to compute the slope ( a_rate) and offset ( a_offset):

a_rate = (rtp2 - rtp1) / (ntp2 - ntp1)

a_offset = rtp2 - a_rate × ntp2

With these parameters, any RTP timestamp can be converted to NTP time and vice‑versa.

Sender Report Structure

The SR packet (PT = 200) includes NTP timestamp (64 bits: 32 bits seconds, 32 bits fractions), RTP timestamp, packet count, and octet count. The NTP fraction is calculated as

fraction = round(microseconds × 2³² / 1 000 000)

Relative Delay, Target Delay, and Overall Synchronization

Relative delay is the sum of network, encoding, packetization, transmission, buffering, decoding, and rendering delays for each media path. The system computes the most recent audio and video packet timestamps, converts them to NTP time, and applies the formula:

relative_delay = (T_video_recv - T_audio_recv) - (T_video_send - T_audio_send)

Target delay for audio is the expected interval between consecutive audio packets; for video it is the sum of network (jitter buffer), decode, and render delays. The final target delay is the maximum of the expected target delay and the minimum playback delay, constrained by buffer limits.

Video Target Delay Calculation

Video target delay = network jitter buffer delay + 95th‑percentile decode time + default render delay (10 ms). Decode time is estimated from the 95th percentile of the last 10 000 decode measurements.

Audio Target Delay Calculation

Audio target delay = expected packet interval (derived from recent packet arrival times) limited to 75 % of the maximum buffer size.

Putting It All Together

By continuously updating the linear RTP‑NTP mapping from SR packets, measuring jitter, decode time, and render delay, and adjusting target buffers, WebRTC achieves tight audio‑video synchronization while maintaining low latency and smooth playback.

timestamp Real-time communication Audio-Video Sync RTP WebRTC NTP

Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.