How RTP and NTP Timestamps Enable Precise Audio‑Video Sync in WebRTC
This article explains the structure and generation of RTP timestamps for audio and video, the role of NTP timestamps as a common time base, how RTP and NTP are correlated through Sender Reports and linear regression, and the calculations used to achieve accurate audio‑video synchronization and target delay management in WebRTC.
RTP Timestamp Basics
RTP timestamps define the sampling moment of media payload data using a monotonically increasing clock whose precision is determined by the media’s sampling frequency. Audio typically uses 8 kHz, 16 kHz or 48 kHz, while video uses frame rates such as 20 fps, 24 fps or 30 fps. Each audio sample or video frame receives a timestamp that is placed in the RTP header; the unit of the timestamp depends on the stream’s sampling rate.
The RTP header consists of fields V (version, 2 bits), P (padding, 1 bit), X (extension, 1 bit), CC (CSRC count, 4 bits), M (marker, 1 bit), PT (payload type, 7 bits), sequence number (16 bits), timestamp (32 bits), SSRC (32 bits) and optional CSRC identifiers (0‑15 × 32 bits). The timestamp reflects the sampling instant of the first octet of the RTP payload.
Audio Timestamp Generation
For a 48 kHz audio stream, each 20 ms packet contains 960 samples, so the RTP timestamp increments by 960 per packet (20 ms × 48 kHz / 1000). A random offset is added when the audio frame is packed into an RTP packet. The generation process is illustrated in the following images:
Video Timestamp Generation
Video timestamps use a 90 kHz clock (unit = 1/90 000 s). The timestamp is derived from the system clock at the moment of frame capture, converted to NTP time, then to RTP time (RTP = NTP × 90). After encoding, a random offset is added before the frame is placed into an RTP packet. The flow is shown in the images below:
NTP Timestamp and Its Relationship to RTP
NTP timestamps count seconds since 1900‑01‑01 00:00:00 and serve as a universal time base for audio‑video synchronization. RTCP Sender Report (SR) packets contain both an RTP timestamp and the corresponding NTP timestamp, allowing receivers to map RTP timestamps to absolute time.
The conversion formula is linear: T_ntp = k·T_rtp + b, where k and b are derived from at least two SR packets. This linear relationship enables the receiver to compute NTP time for any RTP packet.
Audio‑Video Synchronization Principles
Synchronization ensures that audio and video playback progress at the same rate. Three common methods are: (1) sync audio to video, (2) sync video to audio, and (3) sync both to system time. In WebRTC, independent audio and video streams are aligned by converting RTP timestamps to NTP timestamps using the SR‑derived linear mapping.
Linear Regression for RTP‑NTP Mapping
After collecting multiple SR samples, the system performs linear regression to compute the slope ( a_rate) and offset ( a_offset):
a_rate = (rtp2 - rtp1) / (ntp2 - ntp1) a_offset = rtp2 - a_rate × ntp2With these parameters, any RTP timestamp can be converted to NTP time and vice‑versa.
Sender Report Structure
The SR packet (PT = 200) includes NTP timestamp (64 bits: 32 bits seconds, 32 bits fractions), RTP timestamp, packet count, and octet count. The NTP fraction is calculated as
fraction = round(microseconds × 2³² / 1 000 000).
Relative Delay, Target Delay, and Overall Synchronization
Relative delay is the sum of network, encoding, packetization, transmission, buffering, decoding, and rendering delays for each media path. The system computes the most recent audio and video packet timestamps, converts them to NTP time, and applies the formula:
relative_delay = (T_video_recv - T_audio_recv) - (T_video_send - T_audio_send)Target delay for audio is the expected interval between consecutive audio packets; for video it is the sum of network (jitter buffer), decode, and render delays. The final target delay is the maximum of the expected target delay and the minimum playback delay, constrained by buffer limits.
Video Target Delay Calculation
Video target delay = network jitter buffer delay + 95th‑percentile decode time + default render delay (10 ms). Decode time is estimated from the 95th percentile of the last 10 000 decode measurements.
Audio Target Delay Calculation
Audio target delay = expected packet interval (derived from recent packet arrival times) limited to 75 % of the maximum buffer size.
Putting It All Together
By continuously updating the linear RTP‑NTP mapping from SR packets, measuring jitter, decode time, and render delay, and adjusting target buffers, WebRTC achieves tight audio‑video synchronization while maintaining low latency and smooth playback.
OPPO Kernel Craftsman
Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
