How FLV and RTP Interact in Douyu’s Low‑Latency WebRTC Streaming
This article explains the end‑to‑end workflow of Douyu’s fast live streaming system, detailing how FLV tags are converted to RTP packets and back, covering WebRTC’s SDP/ICE/DTLS handshake, FLV and RTP header structures, payload formats for audio (OPUS) and video (H.264), and the server‑side processing pipeline.
1. Simplest Streaming System
Traditional live streaming uses FLV as the transport protocol. In Douyu’s fast live streaming, a custom WebRTC protocol is built on top of the traditional system to achieve lower end‑to‑end latency.
2. Fast Live Pull Process
WebRTC is a complex protocol suite that encapsulates audio/video streams in RTP. The pull process includes:
SDP exchange
ICE negotiation
DTLS handshake
Conversion between FLV tags and RTP packets
RTP/SRTP encryption/decryption
Packet send/receive
The article focuses on the FLV‑to‑RTP conversion.
3. FLV Protocol Overview
FLV (Flash Video) consists of a header followed by a series of tags. Each tag has a header and data, and tags can be audio, video, or script.
3.1 FLV Header
The 9‑byte header contains the signature "FLV", version 1, flags for audio and video, and a data offset of 0x9. The following PreviousTagSize0 is always 0.
3.2 FLV File Body
Each tag is preceded by an 11‑byte tag header and followed by a PreviousTagSize field.
3.2.1 Audio Tags
Audio tag structure includes a tag header and AAC ADTS payload.
3.2.2 Video Tags
Video tag structure includes a tag header and H.264 NALU payload.
4. RTP Protocol Overview
RTP provides real‑time transport over UDP. An RTP packet consists of a fixed 12‑byte header and a payload.
4.1 RTP Header
The first 12 bytes contain fields: version (V), padding (P), extension (X), CSRC count (CC), marker (M), payload type (PT), sequence number, timestamp, SSRC, and optional CSRC list.
4.1.1 One‑byte Header Extension
Starts with 0xBEDE, length field indicates the number of 32‑bit words of extension data.
4.1.2 Two‑byte Header Extension
Starts with 0x100 + app bits, length field similar to one‑byte but ID is 8 bits.
4.2 RTP Payload Types
Payload type (PT) is negotiated via SDP. Common values: 125 for H.264 video, 111 for OPUS audio.
4.2.1 RTP Payload for OPUS
OPUS frames (a few hundred bytes) are placed directly in the RTP payload.
4.2.2 RTP Payload for H.264
Defined by RFC 6184. Payload can be:
Single NALU (type 1‑23)
Aggregation packets (STAP‑A, type 24‑27)
Fragmentation units (FU‑A, type 28‑29)
4.2.2.1 Single NALU Packet
The NALU header is the first byte of the payload, followed by the NALU data.
4.2.2.2 Aggregation Packet (STAP‑A)
Multiple small NALUs are packed together: STAP‑A header, then for each NALU a 2‑byte size field and the NALU data.
4.2.2.3 Fragmentation Unit (FU‑A)
Large NALUs are split into fragments. Each fragment has a FU‑A indicator, a FU‑A header (S, E, R bits and original NALU type), and a fragment of the NALU payload.
5. FLV Tag ↔ RTP Packet Conversion
5.1 FLV Tag Structure
Four tag types are considered: video sequence header, video frame, audio sequence header, audio frame.
5.2 RTP Packet Structure
RTP header (12 bytes) followed by payload as described above.
5.3 Video Sequence Header → RTP
Extract SPS/PPS from the FLV sequence header, pack them into a STAP‑A NALU, prepend the RTP header.
5.4 Video Frame → RTP
If the NALU size ≤ 1300 B, send as a Single NALU packet. If larger, split into 1300 B fragments, add FU‑A headers, and send each fragment with an RTP header.
5.5 Audio Sequence Header → RTP
Parse the audio codec information (AAC) and store for later use.
5.6 Audio Frame → RTP
Convert AAC frames to OPUS: add a 7‑byte ADTS header, decode to PCM, encode to OPUS, then encapsulate each OPUS packet in an RTP payload.
6. RTP Packet → FLV Tag
6.1 Generate FLV Header
Set TypeFlagsAudio/Video according to the stream content and append PreviousTagSize0.
6.3 RTP (OPUS) → FLV Audio Tag
Parse RTP header, identify PT = 111 (OPUS), decode OPUS to PCM, then pack PCM into an FLV audio tag.
6.4 RTP (H.264) → FLV Video Tag
Parse RTP header, identify PT = 125 (H.264), examine the first payload byte to determine packet type (Single NALU, STAP‑A, FU‑A), reconstruct the original NALU(s), and pack them into FLV video tags.
7. Server‑Side RTP Sending Process
After receiving FLV tags, the server parses them, converts them to RTP (or SRTP), adds encryption padding, and sends them to the client.
8. Server‑Side RTP → SRTP
RTP payloads are encrypted, with additional padding and key identifiers added before transmission.
9. Appendix – Relevant Code
9.1 RTP Packet Reception Flow
<code>-> RtpTransport::OnReadPacket
-> RtpTransport::OnRtpPacketReceived
-> RtpTransport::DemuxPacket
-> parsed_packet.Parse // parses RTP header and payload
-> RtpPacket::ParseBuffer
-> RtpDemuxer::OnRtpPacket
-> AudioReceiveStream::OnRtpPacket / RtpVideoStreamReceiver::OnRtpPacket</code>9.2 Video RTP Packet Parsing Flow
<code>-> RtpVideoStreamReceiver::OnRtpPacket
-> RtpVideoStreamReceiver::ReceivePacket
-> RtpVideoStreamReceiver::OnReceivedPayloadData
-> video_coding::PacketBuffer::InsertPacket
-> RtpVideoStreamReceiver::OnAssembledFrame
-> RtpFrameReferenceFinder::ManageFrame
-> RtpFrameReferenceFinder::HandOffFrame
-> RtpVideoStreamReceiver::OnCompleteFrame
-> VideoReceiveStream::OnCompleteFrame
-> video_coding::FrameBuffer::InsertFrame
-> frames_.emplace(id, FrameInfo()) // insert into jitter buffer
-> VideoReceiveStream::Start
-> VideoReceiveStream::StartNextDecode
-> video_coding::FrameBuffer::NextFrame // retrieve NALU from jitter buffer
-> VideoReceiveStream::HandleEncodedFrame(std::unique_ptr<VideoEncodedFrame> frame)</code>9.3 Audio RTP Packet Parsing Flow
<code>-> AudioReceiveStream::OnRtpPacket
-> ChannelReceiveInterface::OnRtpPacket
-> ChannelReceive::ReceivePacket
-> ChannelReceive::OnReceivedPayloadData
-> acm2::AcmReceiver::InsertPacket
-> NetEq::InsertPacket // insert into NetEq
-> AudioTransport::NeedMorePlayData
-> AudioTransportImpl::NeedMorePlayData
-> AudioMixerImpl::Mix
-> AudioMixerImpl::GetAudioFromSources
-> AudioReceiveStream::GetAudioFrameWithInfo
-> ChannelReceive::GetAudioFrameWithInfo
-> acm2::AcmReceiver::GetAudio
-> NetEq::GetAudio // retrieve decoded PCM frame</code>Douyu Streaming
Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.