Big Data 24 min read

What Is Lance BlobV2 and How Does It Improve Large Binary Data Storage?

Lance BlobV2 introduces a PyArrow ExtensionType for large binary objects, enabling lazy streaming, external URI references, range slicing, and flexible ingest or reference modes, while providing clear APIs and schema helpers that address the limitations of previous blob implementations.

Big Data Technology Tribe
Big Data Technology Tribe
Big Data Technology Tribe
What Is Lance BlobV2 and How Does It Improve Large Binary Data Storage?

What is BlobV2?

BlobV2 is the PyArrow ExtensionType in Lance that represents a “large binary object column” with the logical type name lance.blob.v2 .

It is defined in python/python/lance/blob.py as a subclass of pa.ExtensionType with a struct storage containing four fields: data, uri, position, and size.

class BlobType(pa.ExtensionType):
    """A PyArrow extension type for Lance blob columns.
    This is the "logical" type users write. Lance will store it in a compact
    descriptor format, and reads will return descriptors by default.
    """
    def __init__(self) -> None:
        storage_type = pa.struct([
            pa.field("data", pa.large_binary(), nullable=True),
            pa.field("uri", pa.utf8(), nullable=True),
            pa.field("position", pa.uint64(), nullable=True),
            pa.field("size", pa.uint64(), nullable=True),
        ])
        pa.ExtensionType.__init__(self, storage_type, "lance.blob.v2")

The logical type name is lance.blob.v2 and the underlying storage is:

struct<
  data: large_binary,
  uri: utf8,
  position: uint64,
  size: uint64
>

The four fields mean: data: inline bytes stored directly. uri: external URI of the blob. position: offset within the external file. size: length of the slice in the external file.

A BlobV2 cell can represent three kinds of data:

1. inline bytes
2. full external file referenced by URI
3. a range (position + size) inside an external file

Why BlobV2 is needed

Typical Lance workloads involve large binary objects such as images, videos, audio, multimodal training data, model artifacts, PDFs, and slices of tar/zip/parquet files. These objects share characteristics:

They can be very large.

They may not fit directly into an Arrow value.

They often already exist in external object storage.

Reading them all at once is undesirable.

Precise row‑level or address‑level reads are often required.

Lance’s blob columns are lazy: reads return a BlobFile handle that streams bytes on demand. BlobV2 elevates binary objects from ordinary bytes to a column type with explicit semantics, enabling:

Lazy reads

Streaming reads

External URI references

Range references inside external files

Choice between reference and ingest at write time

Precise reads via

take_blobs

Problems solved by BlobV2

Storing large binaries directly in a table

Storing raw bytes in a column leads to rows carrying huge payloads, eager loading of all data, poor streaming, and difficulty referencing existing external objects.

BlobV2 stores a descriptor instead of raw bytes, keeping the table columnar while allowing on‑demand blob retrieval.

{ data: ... }
{ uri: ... }
{ uri: ..., position: ..., size: ... }

Avoiding data duplication when external files already exist

When source files reside in S3, HDFS, or local disks, BlobV2 can store only the URI (or a slice) without copying the data.

Row 1 → s3://bucket/images/0001.jpg
Row 2 → s3://bucket/images/0002.jpg
Row 3 → s3://bucket/videos/a.mp4 (range)

Use Blob.from_uri(...) for explicit external blobs; it also accepts position and size for range references.

Packing multiple blobs inside a single container file

By recording position and size for each member of a tar archive, many payloads can share one physical file.

Blob.from_uri(container_uri, position=m.offset_data, size=m.size)

This suits scenarios such as images inside a tar, audio segments in a large file, video intervals, or archived sub‑objects.

Preventing full‑dataset loads during reads

Reading is recommended via take_blobs, which returns file‑like handles selected by exactly one of ids, indices, or addresses.

ds = lance.dataset("./blobs_v22.lance")
blobs = ds.take_blobs("blob", indices=[0, 1])
with blobs[0] as f:
    data = f.read()

Example of lazy video decoding with av.open(blob) is provided in the source.

import av, lance

ds = lance.dataset("./videos_v22.lance")
blob = ds.take_blobs("video", indices=[0])[0]
start_ms, end_ms = 500, 1000
with av.open(blob) as container:
    stream = container.streams.video[0]
    stream.codec_context.skip_frame = "NONKEY"
    start = (start_ms / 1000) / stream.time_base
    end = (end_ms / 1000) / stream.time_base
    container.seek(int(start), stream=stream)
    for frame in container.decode(stream):
        if frame.time is not None and frame.time > end_ms / 1000:
            break
        # process frame
        pass

Implementation

Blob – the user‑side logical value

@dataclass(frozen=True)
class Blob:
    data: Optional[bytes] = None
    uri: Optional[str] = None
    position: Optional[int] = None
    size: Optional[int] = None

Validation performed in __post_init__ enforces:

Cannot have both data and uri. uri must be non‑null when used.

If position / size are set, uri is required. position and size must be set together.

Inline data cannot carry external slice metadata.

Helper constructors:

Blob.from_bytes(...)
Blob.from_uri(...)
Blob.empty()

BlobType – PyArrow ExtensionType

BlobV2 registers an extension type named lance.blob.v2 with the struct storage shown earlier.

try:
    pa.register_extension_type(BlobType())
except pa.ArrowKeyError:
    pass

blob_field – schema helper

def blob_field(name: str, *, nullable: bool = True) -> pa.Field:
    """Construct an Arrow field for a Lance blob column."""
    return pa.field(name, BlobType(), nullable=nullable)

Thus blob_field("image") is equivalent to pa.field("image", BlobType(), nullable=True).

blob_array – converting Python values to a BlobArray

def blob_array(values: list[Any]) -> BlobArray:
    """Construct a blob array from Python values.
    Each value must be one of:
    - bytes‑like: inline bytes
    - str: an external URI
    - Blob: explicit inline/uri/empty
    - None: null
    """
    return BlobArray.from_pylist(values)

The function splits inputs into four parallel arrays ( data_values, uri_values, position_values, size_values) and builds a pa.ExtensionArray via pa.ExtensionArray.from_storage(BlobType(), storage).

Using BlobV2

Write inline bytes (quick‑start)

import lance, pyarrow as pa
from lance import blob_array, blob_field

schema = pa.schema([
    pa.field("id", pa.int64()),
    blob_field("blob"),
])

table = pa.table({
    "id": [1],
    "blob": blob_array([b"hello blob v2"]),
}, schema=schema)

ds = lance.write_dataset(
    table,
    "./blobs_v22.lance",
    data_storage_version="2.2",
)
blob = ds.take_blobs("blob", indices=[0])[0]
with blob as f:
    assert f.read() == b"hello blob v2"

Mix inline bytes, external URIs, ranges, and nulls in one column

import lance, pyarrow as pa
from lance import Blob, blob_array, blob_field

schema = pa.schema([
    pa.field("id", pa.int64()),
    blob_field("blob", nullable=True),
])

rows = pa.table({
    "id": [1, 2, 3, 4],
    "blob": blob_array([
        b"inline-bytes",
        "s3://bucket/path/video.mp4",
        Blob.from_uri("s3://bucket/archive.tar", position=4096, size=8192),
        None,
    ]),
}, schema=schema)

ds = lance.write_dataset(
    rows,
    "./blobs_v22.lance",
    data_storage_version="2.2",
)

Packed external blobs example

import io, tarfile
from pathlib import Path
import lance, pyarrow as pa
from lance import Blob, blob_array, blob_field

payloads = {"a.bin": b"alpha", "b.bin": b"bravo", "c.bin": b"charlie"}

with tarfile.open("container.tar", "w") as tf:
    for name, data in payloads.items():
        info = tarfile.TarInfo(name)
        info.size = len(data)
        tf.addfile(info, io.BytesIO(data))

blob_values = []
with tarfile.open("container.tar", "r") as tf:
    container_uri = Path("container.tar").resolve().as_uri()
    for name in payloads:
        m = tf.getmember(name)
        blob_values.append(
            Blob.from_uri(container_uri, position=m.offset_data, size=m.size)
        )

schema = pa.schema([
    pa.field("name", pa.utf8()),
    blob_field("blob"),
])

rows = pa.table({
    "name": list(payloads.keys()),
    "blob": blob_array(blob_values),
}, schema=schema)

ds = lance.write_dataset(
    rows,
    "./packed_blobs_v22.lance",
    data_storage_version="2.2",
    allow_external_blob_outside_bases=True,
)

Reading with take_blobs

Three selectors are supported: indices: positional reads within the current snapshot. ids: logical row‑id reads. addresses: physical address reads (mainly for debugging).

# By indices
import lance

ds = lance.dataset("./blobs_v22.lance")
blobs = ds.take_blobs("blob", indices=[0, 1])
with blobs[0] as f:
    data = f.read()

# By row ids
row_ids = ds.to_table(columns=[], with_row_id=True).column("_rowid").to_pylist()
blobs = ds.take_blobs("blob", ids=row_ids[:2])

# By physical addresses
row_addrs = ds.to_table(columns=[], with_row_address=True).column("_rowaddr").to_pylist()
blobs = ds.take_blobs("blob", addresses=row_addrs[:2])

external_blob_mode: reference vs ingest

reference

The dataset stores only the external URI (e.g., Blob.from_uri("s3://bucket/images/0001.jpg")). Advantages: fast writes, no data duplication, smaller dataset size, reuse of existing storage. Drawbacks: the original file must remain accessible, permissions must stay valid, and dataset portability is reduced.

ingest

The write process reads the external bytes and stores them inside Lance‑managed storage, making the dataset self‑contained even if the original file is later deleted.

When to choose which mode

Use ingest when:

Dataset should be self‑contained.

External file lifecycle is uncertain.

Preparing to migrate the dataset.

External files are only temporary inputs.

Use reference when:

External objects have a stable lifecycle.

Copying large files is undesirable.

Multiple datasets share the same original blobs.

Object‑storage paths and permissions are under control.

allow_external_blob_outside_bases

By default, external URIs must map to a registered non‑dataset‑root base path. Setting allow_external_blob_outside_bases=True relaxes this restriction, allowing URIs outside the registered bases, but it complicates portability, permission management, and lifecycle handling.

Differences between BlobV2 and legacy Blob

Key differences:

Lance file format: legacy 0.1, 2.0, 2.1; BlobV2 2.2+.

Arrow type: legacy pa.large_binary() + metadata; BlobV2 pa.ExtensionType.

Type identifier: legacy lance-encoding:blob; BlobV2 lance.blob.v2.

Construction: legacy manually add metadata to a field; BlobV2 use blob_field + blob_array.

Recommended for new datasets: legacy No; BlobV2 Yes.

External URI expression: legacy less structured; BlobV2 native support.

External URI range support: legacy less structured; BlobV2 supports position + size.

Write support (>=2.2): legacy No; BlobV2 Yes.

Read interface: legacy blob‑related APIs; BlobV2 take_blobs.

Version compatibility:

data_storage_version
0.1

, 2.0, 2.1: legacy blob metadata supported; BlobV2 not supported.

data_storage_version
2.2+

: legacy blob write not supported; BlobV2 write/read supported and recommended.

Writing legacy blobs

import lance, pyarrow as pa

schema = pa.schema([
    pa.field("id", pa.int64()),
    pa.field(
        "video",
        pa.large_binary(),
        metadata={"lance-encoding:blob": "true"},
    ),
])

table = pa.table({"id": [1, 2], "video": [b"foo", b"bar"]}, schema=schema)

ds = lance.write_dataset(table, "./legacy_blob_dataset", data_storage_version="2.1")

Legacy blobs require file formats 0.1, 2.0, or 2.1 and the specific metadata flag.

Writing BlobV2

from lance import blob_array, blob_field
import pyarrow as pa, lance

schema = pa.schema([
    pa.field("id", pa.int64()),
    blob_field("blob"),
])

table = pa.table({"id": [1], "blob": blob_array([b"hello blob v2"])}, schema=schema)

ds = lance.write_dataset(table, "./blobs_v22.lance", data_storage_version="2.2")

BlobV2 replaces pa.large_binary() + metadata {"lance-encoding": "blob"} with the explicit extension type pa.ExtensionType("lance.blob.v2"), offering a clearer, strongly‑typed design.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data StorageLancePyArrowBlobV2ExtensionTypeLarge Binary Objects
Big Data Technology Tribe
Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.