Backend Development 27 min read

Design and Implementation of a Custom Multimedia Framework Using FFmpeg

The Haro Street Cat mobile team created a custom multimedia framework that wraps FFmpeg 4.2.2 in a C++ core library with Android/iOS compatibility layers and Java wrappers for transcoding, live streaming, and composition, delivering hardware‑accelerated decoding, flexible filter pipelines, and reliable transcoding that boosted coverage to over 99 %, cut storage by more than 30 %, accelerated video start‑up, and improved streaming and watermarking performance.

HelloTech

Jan 25, 2024

Design and Implementation of a Custom Multimedia Framework Using FFmpeg

Background

The Haro Street Cat mobile team encountered several multimedia bottlenecks during business development, including incompatibility of third‑party live‑stream SDKs on MTK devices, limited flexibility for video capture, composition and filtering, and unreliable video transcoding due to diverse source formats on Android.

To address these issues, the team built a self‑developed multimedia framework that provides robust hardware decoding, flexible filter pipelines, and reliable transcoding.

Custom Multimedia Architecture

The architecture consists of a core C++ library (libpet-media-core.so) that wraps FFmpeg 4.2.2, a compatibility layer (libpet-media-compat.so) for Android and iOS, and higher‑level Java components for transcoding, live streaming, and video composition.

Core Components

Street Cat Media Core Component – contains the FFmpeg source and core implementations for transcoding, live streaming, and composition (platform‑agnostic).

Street Cat Media Transcoding Component – Java wrapper that supports full‑format transcoding, hardware acceleration when available, bitrate/resolution/fps/GOP configuration, segment transcoding, and moov‑atom pre‑placement.

Street Cat Media Live Component – Java wrapper for FLV/HEVC live streams with dynamic soft/hard decoding switching and API compatibility with third‑party SDKs.

Street Cat Media Composition Component – Java wrapper for audio‑video composition, supporting YUV420P/YUV420SP, logo/watermark filters, and configurable output parameters.

Business Results

Transcoding coverage increased from < 50 % to > 99 % with 99 % playback success after validation.

Average video compression ratio improved by > 30 pp, reducing storage and bandwidth costs.

Moov‑atom pre‑placement enabled near‑instant video start‑up.

Live streaming playback rate increased by > 5 pp due to resolved hardware compatibility issues.

Composition and watermarking features boosted content click‑through and sharing rates.

FFmpeg Overview

FFmpeg provides a comprehensive multimedia framework with libraries such as libavcodec, libavformat, libavfilter, libavutil, libavdevice, libswscale, and libswresample. The most widely used layer is the IL (Integration Layer), which abstracts hardware differences and is packaged into libstagefrighthw.so for use by applications.

Core Process Flow

Open the input file with avformat_open_input and retrieve stream information via avformat_find_stream_info.

For each stream, locate the decoder with avcodec_find_decoder, allocate a context with avcodec_alloc_context3, and open it using avcodec_open2.

Create the output context with avformat_alloc_output_context2, add output streams, and allocate corresponding encoders via avcodec_find_encoder and avcodec_alloc_context3.

Synchronize codec parameters with stream parameters using avcodec_parameters_from_context.

Initialize filter graphs (see filter section below).

Read packets with av_read_frame, decode them using avcodec_decode_video2 or avcodec_decode_audio4.

If a filter graph exists, push decoded frames into the source filter with av_buffersrc_add_frame_flags, then pull filtered frames from the sink filter with av_buffersink_get_frame.

Encode frames using avcodec_encode_video2 or avcodec_encode_audio2, rescale timestamps with av_packet_rescale_ts, and write them to the output via av_interleaved_write_frame.

After processing all packets, flush filters and encoders, then write the trailer.

Code Example (Transcoding Sample)

#include <libavcodec/avcodec.h>
#include <libavformat/avformat.h>
#include <libavfilter/buffersink.h>
#include <libavfilter/buffersrc.h>
#include <libavutil/opt.h>
#include <libavutil/pixdesc.h>

static AVFormatContext *ifmt_ctx;
static AVFormatContext *ofmt_ctx;

typedef struct FilteringContext {
    AVFilterContext *buffersink_ctx;
    AVFilterContext *buffersrc_ctx;
    AVFilterGraph *filter_graph;
} FilteringContext;

static FilteringContext *filter_ctx;

typedef struct StreamContext {
    AVCodecContext *dec_ctx;
    AVCodecContext *enc_ctx;
} StreamContext;

static StreamContext *stream_ctx;

static int open_input_file(const char *filename) {
    int ret;
    unsigned int i;
    ifmt_ctx = NULL;
    if ((ret = avformat_open_input(&ifmt_ctx, filename, NULL, NULL)) < 0) {
        av_log(NULL, AV_LOG_ERROR, "Cannot open input file
");
        return ret;
    }
    if ((ret = avformat_find_stream_info(ifmt_ctx, NULL)) < 0) {
        av_log(NULL, AV_LOG_ERROR, "Cannot find stream information
");
        return ret;
    }
    stream_ctx = av_mallocz_array(ifmt_ctx->nb_streams, sizeof(*stream_ctx));
    if (!stream_ctx)
        return AVERROR(ENOMEM);
    for (i = 0; i < ifmt_ctx->nb_streams; i++) {
        AVStream *stream = ifmt_ctx->streams[i];
        AVCodec *dec = avcodec_find_decoder(stream->codecpar->codec_id);
        AVCodecContext *codec_ctx;
        if (!dec) {
            av_log(NULL, AV_LOG_ERROR, "Failed to find decoder for stream #%u
", i);
            return AVERROR_DECODER_NOT_FOUND;
        }
        codec_ctx = avcodec_alloc_context3(dec);
        if (!codec_ctx) {
            av_log(NULL, AV_LOG_ERROR, "Failed to allocate the decoder context for stream #%u
", i);
            return AVERROR(ENOMEM);
        }
        ret = avcodec_parameters_to_context(codec_ctx, stream->codecpar);
        if (ret < 0) {
            av_log(NULL, AV_LOG_ERROR, "Failed to copy decoder parameters to input decoder context for stream #%u
", i);
            return ret;
        }
        if (codec_ctx->codec_type == AVMEDIA_TYPE_VIDEO || codec_ctx->codec_type == AVMEDIA_TYPE_AUDIO) {
            if (codec_ctx->codec_type == AVMEDIA_TYPE_VIDEO)
                codec_ctx->framerate = av_guess_frame_rate(ifmt_ctx, stream, NULL);
            ret = avcodec_open2(codec_ctx, dec, NULL);
            if (ret < 0) {
                av_log(NULL, AV_LOG_ERROR, "Failed to open decoder for stream #%u
", i);
                return ret;
            }
        }
        stream_ctx[i].dec_ctx = codec_ctx;
    }
    av_dump_format(ifmt_ctx, 0, filename, 0);
    return 0;
}

/* ... (the rest of the source code follows the same pattern, including open_output_file, init_filters, filter_encode_write_frame, flush_encoder, and main) ... */

Frame Data Storage

FFmpeg uses AVPacket to store encoded frame data and AVFrame for decoded audio/video frames.

Audio Frame Storage

Packed layout: interleaved samples (e.g., L R L R).

Planar layout: separate planes per channel (e.g., L L L L R R R R).

For packed data, frame.data[0] (or frame.extended_data[0]) contains all PCM samples. For planar data, each channel is stored in frame.data[i] (or frame.extended_data[i]). The AVFrame.data array has a fixed size of 8; channels beyond 8 are accessed via extended_data.

Key audio fields: format (sample format), sample_rate, channel_layout, and nb_samples. The duration can be derived from sample_rate and nb_samples, and the buffer length from format * channel_layout * nb_samples.

Video Frame Storage

Video frames are stored in YUV formats. Common subsampling includes YUV444, YUV422, and YUV420. YUV420 is the most widely used due to its balance of quality and storage efficiency. Variants such as YUV420P (planar) and YUV420SP/NV12 (semi‑planar) differ in how the UV plane is stored.

In AVFrame, video data is accessed via uint8_t *data[AV_NUM_DATA_POINTERS] and int linesize[AV_NUM_DATA_POINTERS]. For YUV420P, three planes are used (Y, U, V); for YUV420SP, two planes are used (Y and interleaved UV).

TimeBase

FFmpeg uses three main time bases: tbr – frame rate (e.g., 1/25 for 25 fps). tbn – stream time base, used for demuxing and seeking. tbc – codec time base, set in the codec context during encoding.

Timestamp conversion: timestamp (seconds) = pts * av_q2d(time_base). Conversion between different time bases is performed with av_rescale_q, and packet timestamp rescaling uses av_packet_rescale_ts.

Filter Graph

The filter graph processes frames through a chain of filters. After creating the graph, a source filter buffers incoming frames; the graph processes them, and the sink filter provides the filtered frames.

Initialization steps:

Allocate a filter graph with avfilter_graph_alloc.

Obtain filter definitions via avfilter_get_by_name.

Create filter contexts with avfilter_graph_create_filter.

Parse the filter specification string and link inputs/outputs using avfilter_graph_parse_ptr.

During processing:

Push frames into the graph with av_buffersrc_add_frame_flags.

Retrieve processed frames with av_buffersink_get_frame.

An example filter spec for adding a watermark:

static const char* filters[] = { "movie=/sdcard/0/pet_logo.png[watermark];[in][watermark]overlay=main_w-overlay_w-10:main_h-overlay_h-10[out]" };

For more filter details, refer to the official FFmpeg filter documentation: https://ffmpeg.org/ffmpeg-filters.html .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

video FFmpeg multimedia audio transcoding Filter Graph C++

Written by

HelloTech

Official Hello technology account, sharing tech insights and developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.