Fundamentals 19 min read

How FLV and RTP Interact in Douyu’s Low‑Latency WebRTC Streaming

This article explains the end‑to‑end workflow of Douyu’s fast live streaming system, detailing how FLV tags are converted to RTP packets and back, covering WebRTC’s SDP/ICE/DTLS handshake, FLV and RTP header structures, payload formats for audio (OPUS) and video (H.264), and the server‑side processing pipeline.

Douyu Streaming
Douyu Streaming
Douyu Streaming
How FLV and RTP Interact in Douyu’s Low‑Latency WebRTC Streaming

1. Simplest Streaming System

Traditional live streaming uses FLV as the transport protocol. In Douyu’s fast live streaming, a custom WebRTC protocol is built on top of the traditional system to achieve lower end‑to‑end latency.

2. Fast Live Pull Process

WebRTC is a complex protocol suite that encapsulates audio/video streams in RTP. The pull process includes:

SDP exchange

ICE negotiation

DTLS handshake

Conversion between FLV tags and RTP packets

RTP/SRTP encryption/decryption

Packet send/receive

The article focuses on the FLV‑to‑RTP conversion.

3. FLV Protocol Overview

FLV (Flash Video) consists of a header followed by a series of tags. Each tag has a header and data, and tags can be audio, video, or script.

3.1 FLV Header

The 9‑byte header contains the signature "FLV", version 1, flags for audio and video, and a data offset of 0x9. The following PreviousTagSize0 is always 0.

3.2 FLV File Body

Each tag is preceded by an 11‑byte tag header and followed by a PreviousTagSize field.

3.2.1 Audio Tags

Audio tag structure includes a tag header and AAC ADTS payload.

3.2.2 Video Tags

Video tag structure includes a tag header and H.264 NALU payload.

4. RTP Protocol Overview

RTP provides real‑time transport over UDP. An RTP packet consists of a fixed 12‑byte header and a payload.

4.1 RTP Header

The first 12 bytes contain fields: version (V), padding (P), extension (X), CSRC count (CC), marker (M), payload type (PT), sequence number, timestamp, SSRC, and optional CSRC list.

4.1.1 One‑byte Header Extension

Starts with 0xBEDE, length field indicates the number of 32‑bit words of extension data.

4.1.2 Two‑byte Header Extension

Starts with 0x100 + app bits, length field similar to one‑byte but ID is 8 bits.

4.2 RTP Payload Types

Payload type (PT) is negotiated via SDP. Common values: 125 for H.264 video, 111 for OPUS audio.

4.2.1 RTP Payload for OPUS

OPUS frames (a few hundred bytes) are placed directly in the RTP payload.

4.2.2 RTP Payload for H.264

Defined by RFC 6184. Payload can be:

Single NALU (type 1‑23)

Aggregation packets (STAP‑A, type 24‑27)

Fragmentation units (FU‑A, type 28‑29)

4.2.2.1 Single NALU Packet

The NALU header is the first byte of the payload, followed by the NALU data.

4.2.2.2 Aggregation Packet (STAP‑A)

Multiple small NALUs are packed together: STAP‑A header, then for each NALU a 2‑byte size field and the NALU data.

4.2.2.3 Fragmentation Unit (FU‑A)

Large NALUs are split into fragments. Each fragment has a FU‑A indicator, a FU‑A header (S, E, R bits and original NALU type), and a fragment of the NALU payload.

5. FLV Tag ↔ RTP Packet Conversion

5.1 FLV Tag Structure

Four tag types are considered: video sequence header, video frame, audio sequence header, audio frame.

5.2 RTP Packet Structure

RTP header (12 bytes) followed by payload as described above.

5.3 Video Sequence Header → RTP

Extract SPS/PPS from the FLV sequence header, pack them into a STAP‑A NALU, prepend the RTP header.

5.4 Video Frame → RTP

If the NALU size ≤ 1300 B, send as a Single NALU packet. If larger, split into 1300 B fragments, add FU‑A headers, and send each fragment with an RTP header.

5.5 Audio Sequence Header → RTP

Parse the audio codec information (AAC) and store for later use.

5.6 Audio Frame → RTP

Convert AAC frames to OPUS: add a 7‑byte ADTS header, decode to PCM, encode to OPUS, then encapsulate each OPUS packet in an RTP payload.

6. RTP Packet → FLV Tag

6.1 Generate FLV Header

Set TypeFlagsAudio/Video according to the stream content and append PreviousTagSize0.

6.3 RTP (OPUS) → FLV Audio Tag

Parse RTP header, identify PT = 111 (OPUS), decode OPUS to PCM, then pack PCM into an FLV audio tag.

6.4 RTP (H.264) → FLV Video Tag

Parse RTP header, identify PT = 125 (H.264), examine the first payload byte to determine packet type (Single NALU, STAP‑A, FU‑A), reconstruct the original NALU(s), and pack them into FLV video tags.

7. Server‑Side RTP Sending Process

After receiving FLV tags, the server parses them, converts them to RTP (or SRTP), adds encryption padding, and sends them to the client.

8. Server‑Side RTP → SRTP

RTP payloads are encrypted, with additional padding and key identifiers added before transmission.

9. Appendix – Relevant Code

9.1 RTP Packet Reception Flow

<code>-&gt; RtpTransport::OnReadPacket
    -&gt; RtpTransport::OnRtpPacketReceived
        -&gt; RtpTransport::DemuxPacket
            -&gt; parsed_packet.Parse // parses RTP header and payload
                -&gt; RtpPacket::ParseBuffer
            -&gt; RtpDemuxer::OnRtpPacket
                -&gt; AudioReceiveStream::OnRtpPacket / RtpVideoStreamReceiver::OnRtpPacket</code>

9.2 Video RTP Packet Parsing Flow

<code>-&gt; RtpVideoStreamReceiver::OnRtpPacket
    -&gt; RtpVideoStreamReceiver::ReceivePacket
        -&gt; RtpVideoStreamReceiver::OnReceivedPayloadData
            -&gt; video_coding::PacketBuffer::InsertPacket
                -&gt; RtpVideoStreamReceiver::OnAssembledFrame
                    -&gt; RtpFrameReferenceFinder::ManageFrame
                        -&gt; RtpFrameReferenceFinder::HandOffFrame
                            -&gt; RtpVideoStreamReceiver::OnCompleteFrame
                                -&gt; VideoReceiveStream::OnCompleteFrame
                                    -&gt; video_coding::FrameBuffer::InsertFrame
                                        -&gt; frames_.emplace(id, FrameInfo()) // insert into jitter buffer

-&gt; VideoReceiveStream::Start
    -&gt; VideoReceiveStream::StartNextDecode
        -&gt; video_coding::FrameBuffer::NextFrame // retrieve NALU from jitter buffer
            -&gt; VideoReceiveStream::HandleEncodedFrame(std::unique_ptr&lt;VideoEncodedFrame&gt; frame)</code>

9.3 Audio RTP Packet Parsing Flow

<code>-&gt; AudioReceiveStream::OnRtpPacket
    -&gt; ChannelReceiveInterface::OnRtpPacket
        -&gt; ChannelReceive::ReceivePacket
            -&gt; ChannelReceive::OnReceivedPayloadData
                -&gt; acm2::AcmReceiver::InsertPacket
                    -&gt; NetEq::InsertPacket // insert into NetEq

-&gt; AudioTransport::NeedMorePlayData
    -&gt; AudioTransportImpl::NeedMorePlayData
        -&gt; AudioMixerImpl::Mix
            -&gt; AudioMixerImpl::GetAudioFromSources
                -&gt; AudioReceiveStream::GetAudioFrameWithInfo
                    -&gt; ChannelReceive::GetAudioFrameWithInfo
                        -&gt; acm2::AcmReceiver::GetAudio
                            -&gt; NetEq::GetAudio // retrieve decoded PCM frame</code>
StreamingProtocolRTPWebRTCFLVMedia
Douyu Streaming
Written by

Douyu Streaming

Official account of Douyu Streaming Development Department, sharing audio and video technology best practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.