Mobile Development 18 min read

Overview of iOS Live Streaming Workflow

This article provides a comprehensive overview of the iOS live‑streaming workflow, detailing the six stages—capture, processing, encoding, packaging, network transmission, and playback—along with sample code for video/audio capture, encoding settings, and RTMP transmission.

Sohu Tech Products

Mar 2, 2022

iOS Live Streaming Workflow Overview

The purpose of this article is to explain the stages a live stream goes through on iOS, from capturing raw media to delivering it to viewers.

📌 Introduction

The live‑streaming process is divided into six phases: Capture, Processing, Encoding, Packaging, Network Transmission, and Playback.

Capture

Processing

Encoding

Packaging

Network Transmission

Playback

📷 Capture

Capture includes both video and audio. On iOS, AVFoundation is used for camera video, ReplayKit for screen recording, and Audio Unit for audio.

Video Capture: Camera

Core Classes AVCaptureXXX

The main classes for camera capture are shown below.

Sample code:

<code style="padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px"><span style="color: #5c6370; font-style: italic; line-height: 26px">// 1. Create a session</span>
var session = AVCaptureSession.init()

<span style="color: #5c6370; font-style: italic; line-height: 26px">// 2. Get the camera device</span>
guard let device = AVCaptureDevice.default(for: .video) else {
    print("Failed to get back camera")
    return
}

<span style="color: #5c6370; font-style: italic; line-height: 26px">// 3. Create input</span>
let input = try AVCaptureDeviceInput(device: device)
if session.canAddInput(input) {
    session.addInput(input)
}

<span style="color: #5c6370; font-style: italic; line-height: 26px">// 4. Create output</span>
let videoOutput = AVCaptureVideoDataOutput.init()
let pixelBufferFormat = kCVPixelBufferPixelFormatTypeKey as String
// Set YUV video format
videoOutput.videoSettings = [pixelBufferFormat: kCVPixelFormatType_420YpCbCr8BiPlanarFullRange]
videoOutput.setSampleBufferDelegate(self, queue: outputQueue)
if session.canAddOutput(videoOutput) {
    session.addOutput(videoOutput)
}

<span style="color: #5c6370; font-style: italic; line-height: 26px">// 5. Set preview layer</span>
let previewViewLayer = videoConfig.previewView.layer
previewViewLayer.backgroundColor = UIColor.black.cgColor
let layerFrame = previewViewLayer.bounds
let videoPreviewLayer = AVCaptureVideoPreviewLayer(session: session)
videoPreviewLayer.frame = layerFrame
videoConfig.previewView.layer.insertSublayer(videoPreviewLayer, at: 0)

<span style="color: #5c6370; font-style: italic; line-height: 26px">// 6. Process video frames in delegate</span>
func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    // TODO: handle video frame
}
</code>

Color Sub‑sampling: YUV

Media data is usually compressed using color sub‑sampling. The code sets the output format to kCVPixelFormatType_420YpCbCr8BiPlanarFullRange, where 420 indicates 4:2:0 chroma subsampling and YpCbCr represents the YUV format. YpCbCr (YUV) – Y is luminance, Cb and Cr are chroma components.

Human eyes are more sensitive to luminance, allowing chroma to be heavily compressed.

Video Capture: Screen Recording

Screen recording can be done inside an app (captures only the app’s UI) or outside (captures the whole device screen, useful for game streaming).

1. In‑App Capture

<code style="padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px"><span style="color: #5c6370; font-style: italic; line-height: 26px">// iOS screen recording uses ReplayKit</span>
import ReplayKit

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Start recording</span>
RPScreenRecorder.shared().startCapture { sampleBuffer, bufferType, err in
    // handle sample buffer
} completionHandler: { err in
    // handle error
}

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Stop recording</span>
RPScreenRecorder.shared().stopCapture { err in
    // handle error
}
</code>

Tips for in‑app capture:

UI that should not be recorded can be placed on a custom UIWindow.

Enable the front‑camera preview via RPScreenRecorder.shared().cameraPreviewView and add it to the view hierarchy.

2. Out‑of‑App Capture

Requires a Broadcast Upload Extension that provides a SampleHandler class to receive video data.

<code style="padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menhi, monospace; font-size: 12px">class SampleHandler: RPBroadcastSampleHandler {
    func sohuSportUserDefaults() -> UserDefaults? {
        return UserDefaults(suiteName: "com.xxx.xx")
    }
    override func broadcastStarted(withSetupInfo setupInfo: [String : NSObject]?) {
        // start capture
    }
    override func broadcastPaused() {
        // pause capture
    }
    override func broadcastResumed() {
        // resume capture
    }
    override func broadcastFinished() {
        // finish capture
    }
    // Process incoming sample buffers
    override func processSampleBuffer(_ sampleBuffer: CMSampleBuffer, with sampleBufferType: RPSampleBufferType) {
        switch sampleBufferType {
        case .video:
            // handle video
        case .audioApp:
            // handle app audio
        case .audioMic:
            // handle mic audio
        }
    }
}
</code>

Communication between the extension and the main app can use App Groups, sockets, or CFNotification.

Audio Capture: Audio Unit

Audio Unit provides low‑level access to audio capture with configurable parameters for high‑quality, low‑latency recording.

<code style="padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px"><span style="color: #5c6370; font-style: italic; line-height: 26px">// Create audio unit</span>
self.component = AudioComponentFindNext(nil, &acd)
OSStatus status = AudioComponentInstanceNew(self.component, &_audio_unit)
if (status != noErr) {
    [self handleAudiounitCreateFail]
}

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Configure stream format</span>
AudioStreamBasicDescription desc = {0}
desc.mSampleRate = 44100
desc.mFormatID = kAudioFormatLinearPCM
desc.mFormatFlags = kAudioFormatFlagIsSignedInteger | kAudioFormatFlagsNativeEndian | kAudioFormatFlagIsPacked
desc.mChannelsPerFrame = 1
desc.mFramesPerPacket = 1
desc.mBitsPerChannel = 16
desc.mBytesPerFrame = desc.mBitsPerChannel / 8 * desc.mChannelsPerFrame
desc.mBytesPerPacket = desc.mBytesPerFrame * desc.mFramesPerPacket

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Set callback</span>
AURenderCallbackStruct callback
callback.inputProcRefCon = (__bridge void *)(self)
callback.inputProc = handleVideoInputBuffer
AudioUnitSetProperty(self.audio_unit, kAudioUnitProperty_StreamFormat, kAudioUnitScope_Output, 1, &desc, sizeof(desc))
AudioUnitSetProperty(self.audio_unit, kAudioOutputUnitProperty_SetInputCallback, kAudioUnitScope_Global, 1, &callback, sizeof(callback))

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Configure AVAudioSession</span>
AVAudioSession *session = [AVAudioSession sharedInstance]
[session setCategory:AVAudioSessionCategoryPlayAndRecord withOptions:AVAudioSessionCategoryOptionDefaultToSpeaker | AVAudioSessionCategoryOptionInterruptSpokenAudioAndMixWithOthers error:nil]
[session setActive:YES withOptions:kAudioSessionSetActiveFlag_NotifyOthersOnDeactivation error:nil]
</code>

🚧 Processing

Processing works on SampleBuffer to apply whitening, smoothing, filters, etc., typically using GPUImage (OpenGL or Metal) which offers over 100 filters.

🛠 Encoding

After processing, audio and video are encoded. Video encoding discards redundant information (spatial, temporal, visual, knowledge, structural) using lossy compression. Common codecs are H.264 and H.265.

Video Encoding Example

<code style="padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px"><span style="color: #5c6370; font-style: italic; line-height: 26px">// Create encoder</span>
OSStatus status = VTCompressionSessionCreate(NULL, _configuration.videoSize.width, _configuration.videoSize.height, kCMVideoCodecType_H264, NULL, NULL, NULL, VideoCompressonOutputCallback, (__bridge void *)self, &compressionSession);

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Set encoder properties</span>
VTSessionSetProperty(compressionSession, kVTCompressionPropertyKey_MaxKeyFrameInterval, (__bridge CFTypeRef)@(_videoMaxKeyframeInterval));

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Prepare to encode</span>
VTCompressionSessionPrepareToEncodeFrames(compressionSession);

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Encode frame</span>
OSStatus status = VTCompressionSessionEncodeFrame(compressionSession, pixelBuffer, presentationTimeStamp, duration, (__bridge CFDictionaryRef)properties, (__bridge_retained void *)timeNumber, &flags);
</code>

Audio Encoding Example

<code style="padding: 16px; color: #abb2bf; display: -webkit-box; font-family: Operator Mono, Consolas, Monaco, Menlo, monospace; font-size: 12px">#import <AudioToolbox/AudioToolbox.h>

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Create encoder</span>
OSStatus result = AudioConverterNewSpecific(&inputFormat, &outputFormat, 2, requestedCodecs, &m_converter);

<span style="color: #5c6370; font-style: italic; line-height: 26px">// Encode</span>
AudioConverterFillComplexBuffer(m_converter, inputDataProc, &buffers, &outputDataPacketSize, &outBufferList, NULL);
</code>

📦 Packaging

Encoded streams are placed into container formats such as MP4, FLV, or TS. Live streaming commonly uses FLV or TS because they support streaming protocols.

🕸 Network Transmission

RTMP (based on TCP) is typically used. Media data is wrapped into RTMP messages, each consisting of a header (type, length, timestamp) and a body. Messages are split into 128‑byte chunks for transmission.

🖥 Playback

Clients pull the stream, reassemble chunks into messages, demux the container, decode audio and video, synchronize them, and render video while playing audio.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mobile Development iOS live streaming encoding RTMP Video Capture AVFoundation

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.