Fundamentals 14 min read

Understanding M4A File Structure and Using the Sample Table Box for Random Access

The article outlines the MP4‑based M4A container’s hierarchical box format, details the Sample Table (stbl) and its essential sub‑boxes (stts, stsc, stco/co64, stsz) that map playback time to file offsets, explains time‑scale conversion, optional boxes, and provides pseudo‑code and parsing tips for precise random‑access seeking.

Tencent Music Tech Team
Tencent Music Tech Team
Tencent Music Tech Team
Understanding M4A File Structure and Using the Sample Table Box for Random Access

This article introduces the overall structure of M4A files, focusing on the Sample Table Box (stbl) and how it enables random access within an M4A file.

MP4 File Structure Overview

MP4 files consist of a series of nested boxes, each containing its size, type, and payload. The box hierarchy is similar to object‑oriented inheritance, where all boxes inherit from a base Box class.

class Box {
    uint8 size;
    uint8 type;
    if (size == 1) {
        uint64 largeSize;
    } else {
        // until end of file
    }
    if (boxtype == 'uuid') {
        uint8 [16] usertype;
    }
}

Boxes that extend Box are called FullBox and add version and flags fields:

class FullBox extends Box {
    uint8 version;
    bit[24] flags;
}

The most common boxes are shown in the diagram (omitted here). M4A is essentially an MP4 container that contains only audio tracks.

Difference Between M4A and MP4

M4A lacks the stss box because all audio frames are sync frames. The presence of elst in M4A is uncertain and not covered in this article.

What Is the Sample Table Box (stbl)?

The stbl box holds a time‑to‑sample index for all media samples in a track. Its main purpose is to convert a playback time into the file offset of the corresponding sample, which is crucial for seeking in streaming playback.

A simple average‑bitrate calculation can estimate an offset, but precise random access requires parsing the stbl hierarchy.

Pseudo‑code for Seeking

long seek(long timeMs, long timeScale) {
    long actualTimeMs;
    uint8 time = (timeMs/1000) * timeScale;   // [1]
    if (trak contains elst) {
        time += elst.get(time);               // [2]
    }
    uint8 sample = stts.get(time);            // [3]
    if (stbl contains stss) {
        sample = stss.get(sample);           // [4]
    }
    uint8 chunk, firstSampleInChunk;
    [chunk, firstSampleInChunk] = stsc.get(sample); // [5]
    uint8 chunkOffset;
    if (stbl contains stco) {
        chunkOffset = stco.get(chunk);        // [6]
    } else if (stbl contains co64) {
        chunkOffset = co64.get(chunk);        // [6]
    } else {
        throw "Invalid file header!";
    }
    uint8 sampleOffsetInChunk = stsz.get(sample, firstSampleInChunk); // [7]
    return chunkOffset + sampleOffsetInChunk; // [8]
}

The required boxes are:

stts (Decoding Time to Sample Box)

stsc (Sample to Chunk Box)

stco or co64 (Chunk Offset Box)

stsz (Sample Size Box)

Optional boxes include elst , stss , and ctts .

1. Time‑Unit Conversion

MP4 uses its own time units. The conversion is:

time = realTimeInSeconds * timeScale

timeScale represents the number of time units per second (e.g., audio sample rate or video frame rate).

2. Handling Time Offsets (elst)

If a track contains an elst box, timestamps are shifted (e.g., audio lagging video). Details are omitted but can be found in the ISO/IEC 14496‑12 spec and FFmpeg's mov.c implementation.

3. Decoding Time to Sample (stts)

The stts box stores pairs of sample_count and sample_delta to describe how many samples share the same duration.

class stts extends FullBox {
    uint8 entry_count;
    uint8[entry_count] sample_count;
    uint8[entry_count] sample_delta;
}

Example entry: 14 samples, each with a delta of 10 time units.

4. Sync Sample Box (stss)

For video, stss lists key frames. M4A audio does not use this box because every audio sample is a sync sample.

5. Sample to Chunk (stsc)

Media data is stored in chunks inside the mdat box. The stsc box maps a sample to its containing chunk.

class stsc extends FullBox {
    uint8 entry_count;
    uint8[entry_count] first_chunk;
    uint8[entry_count] samples_per_chunk;
    uint8[entry_count] sample_description_index;
}

Example: chunks 1‑49 contain 10 samples each; from chunk 50 onward, 20 samples per chunk.

6. Chunk Offset (stco / co64)

The stco box gives the file offset of each chunk. co64 is the 64‑bit version.

class stco extends FullBox {
    uint8 entry_count;
    uint8[entry_count] chunk_offset;
}

Chunk numbers start at 1, so subtract 1 when indexing the array.

7. Sample Size (stsz)

The stsz box provides the size of each sample. If sample_size is non‑zero, all samples share that size; otherwise, individual sizes are stored in entry_size .

class stsz extends FullBox {
    uint8 sample_size;
    uint8 sample_count;
    uint8[sample_count] entry_size;
}

Using stsz together with the chunk offset yields the final byte offset of a sample.

Parsing Tips

Read the first 8 bytes of a box: 4‑byte size followed by 4‑byte type.

When converting byte[] to integers, use big‑endian order.

When parsing arrays that are interleaved (e.g., stts ), iterate as:

for (uint8 i = 0; i < entry_count; i++) {
    sample_count[i] = readInt();
    sample_delta[i] = readInt();
}

Conclusion

Compared with formats like FLAC that aim to minimize every bit, MP4 provides generous metadata, making random access relatively straightforward. Typical key‑sample intervals in MP4 are 0.02‑0.04 s, whereas FLAC’s seek table uses ~10 s intervals.

Observed STBL sizes are small relative to total file size (e.g., 19 KB for a 1.3 MB file of 209 s duration).

Further exploration of optional boxes ( elst , stss , ctts , etc.) can be done by consulting the ISO/IEC 14496‑12 specification and related online resources.

References

ISO/IEC 14496‑12 (see Appendix A, pages 11‑12).

"MP4文件格式的解析,以及MP4文件的分割算法".

"MP4文件elst研究".

ParsingMP4M4AMedia File Formatrandom accessSample Table Box
Tencent Music Tech Team
Written by

Tencent Music Tech Team

Public account of Tencent Music's development team, focusing on technology sharing and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.