Fundamentals 17 min read

Fundamentals of Audio and Video Capture for Real‑Time Applications

This article introduces the basic concepts of audio and video capture—including sampling, quantization, PCM storage, YUV formats, camera operation, and pixel resolution—explaining how these technologies enable non‑contact, fully digital government procurement services during the COVID‑19 pandemic.

政采云技术
政采云技术
政采云技术
Fundamentals of Audio and Video Capture for Real‑Time Applications

Preface

In 2020, the COVID‑19 pandemic created a complex situation for epidemic prevention. To reduce gatherings and protect health, the Ministry of Finance issued a notice (Document No. 29 [2020]) requiring government procurement to be carried out electronically as much as possible.

During this period, the GovProcure Cloud platform launched a “non‑contact online service” that avoids face‑to‑face interaction by leveraging real‑time audio‑video and AI technologies.

GovProcure Cloud integrates real‑time audio‑video, AI, and other techniques to move the entire procurement workflow—tender document retrieval, bid preparation, bidding, opening, and expert evaluation—online, achieving a fully electronic, paper‑less process with remote interaction and complete traceability.

Building a successful audio‑video product requires solid knowledge of audio‑video fundamentals. This article, written for beginners, discusses essential concepts of audio‑video technology.

Audio‑Video Technology Helps Government Procurement (1) Audio‑Video Capture

Real‑time audio‑video applications involve many stages: capture, encoding, preprocessing, transmission, decoding, buffering, rendering, etc. Each stage contains further sub‑modules. We start with audio‑video capture.

1. Audio Capture

Raw audio‑video data is captured from client devices such as cameras and microphones, producing what is called a “raw stream”.

1.1 Audio Sampling

When sound waves reach a microphone, the carbon film vibrates and contacts an electrode, converting acoustic energy into a voltage signal. After amplification, the analog audio signal is digitized via A/D conversion, typically using Pulse Code Modulation (PCM).

1.2 Audio Quantization

To store the signal as PCM, the audio is quantized. Key dimensions include:

Sampling rate – number of samples per second (e.g., 44.1 kHz for CD quality, >40 kHz to cover the human hearing range 20 Hz–20 kHz).

Bit depth – precision of each sample (e.g., 16 bit, 24 bit), affecting dynamic range and fidelity.

Channel count – mono, stereo, or multi‑channel (e.g., 1, 2, 4, 6, 8 channels).

Duration – length of the recording.

CD audio uses a 44 100 Hz sampling rate due to historical reasons: early digital recording combined a PAL video recorder (50 Hz field rate) with 294 scan lines, each storing three audio samples, resulting in 44 100 samples per second.

1.3 Audio Encoding and Storage

PCM storage format example: a mono, 8‑bit, 11 kHz audio stream.

Sample

T1

T2

T3

T4

T5

T6

Amplitude

0x05

0x06

0x04

0x07

0x05

0x07

Each sample is stored in one byte (8 bit). For multi‑channel audio, samples for each channel are interleaved (e.g., LRLRLR for stereo). 16‑bit audio can be stored in little‑endian or big‑endian order.

Because PCM is uncompressed, its capacity follows the formula: capacity(bit) = time(s) × (bit‑depth/8) × sampleRate × channels.

2. Video Capture

Video is a sequence of images (frames) displayed over time. The capture process converts analog light into digital YUV data, which is then encoded (e.g., H.264) for transmission.

2.1 Camera Working Principle

Light passes through the lens and forms an optical image on the sensor. The sensor converts light into electrical signals, which are digitized by A/D conversion, processed by a DSP, and sent to the host device for display.

2.2 Data Formats

The sensor outputs raw RGB data, which the DSP converts to either RGB or YUV formats. YUV is widely used for video because it separates luminance (Y) from chrominance (U and V), matching human visual sensitivity.

Y = 0.299R + 0.587G + 0.114B
U = -0.1678R - 0.3313G + 0.5B
V = 0.5R - 0.4187G - 0.0813B

YUV sampling schemes include YUV4:4:4, YUV4:2:2, and YUV4:2:0, which differ in how often chroma components are sampled relative to luma.

YUV 4:4:4 – full sampling, 24 bits per pixel.

YUV 4:2:2 – horizontal 2:1 subsampling, 16 bits per pixel.

YUV 4:2:0 – horizontal 2:1 and vertical 2:1 subsampling, 12 bits per pixel.

Common YUV420 formats (YUV420P, YUV420SP) store data as width × height for Y, followed by width × height/4 for U and V.

2.3 Pixel and Resolution

A pixel is the smallest picture element. Resolution is expressed as width × height (e.g., 1920 × 1080), indicating the total number of pixels. Higher pixel density yields clearer images, while screen size determines the physical size of each pixel.

3. Conclusion

This article covered the fundamentals of audio‑video data capture, including raw data acquisition, sampling, quantization, and common storage formats. Subsequent chapters will discuss typical audio‑video encoding, protocols, and optimization strategies.

real-timeVideoAudiotechnologyYUVcapturePCM
政采云技术
Written by

政采云技术

ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.