What Is Lance BlobV2 and How Does It Improve Large Binary Data Storage?
Lance BlobV2 introduces a PyArrow ExtensionType for large binary objects, enabling lazy streaming, external URI references, range slicing, and flexible ingest or reference modes, while providing clear APIs and schema helpers that address the limitations of previous blob implementations.
What is BlobV2?
BlobV2 is the PyArrow ExtensionType in Lance that represents a “large binary object column” with the logical type name lance.blob.v2 .
It is defined in python/python/lance/blob.py as a subclass of pa.ExtensionType with a struct storage containing four fields: data, uri, position, and size.
class BlobType(pa.ExtensionType):
"""A PyArrow extension type for Lance blob columns.
This is the "logical" type users write. Lance will store it in a compact
descriptor format, and reads will return descriptors by default.
"""
def __init__(self) -> None:
storage_type = pa.struct([
pa.field("data", pa.large_binary(), nullable=True),
pa.field("uri", pa.utf8(), nullable=True),
pa.field("position", pa.uint64(), nullable=True),
pa.field("size", pa.uint64(), nullable=True),
])
pa.ExtensionType.__init__(self, storage_type, "lance.blob.v2")The logical type name is lance.blob.v2 and the underlying storage is:
struct<
data: large_binary,
uri: utf8,
position: uint64,
size: uint64
>The four fields mean: data: inline bytes stored directly. uri: external URI of the blob. position: offset within the external file. size: length of the slice in the external file.
A BlobV2 cell can represent three kinds of data:
1. inline bytes
2. full external file referenced by URI
3. a range (position + size) inside an external fileWhy BlobV2 is needed
Typical Lance workloads involve large binary objects such as images, videos, audio, multimodal training data, model artifacts, PDFs, and slices of tar/zip/parquet files. These objects share characteristics:
They can be very large.
They may not fit directly into an Arrow value.
They often already exist in external object storage.
Reading them all at once is undesirable.
Precise row‑level or address‑level reads are often required.
Lance’s blob columns are lazy: reads return a BlobFile handle that streams bytes on demand. BlobV2 elevates binary objects from ordinary bytes to a column type with explicit semantics, enabling:
Lazy reads
Streaming reads
External URI references
Range references inside external files
Choice between reference and ingest at write time
Precise reads via
take_blobsProblems solved by BlobV2
Storing large binaries directly in a table
Storing raw bytes in a column leads to rows carrying huge payloads, eager loading of all data, poor streaming, and difficulty referencing existing external objects.
BlobV2 stores a descriptor instead of raw bytes, keeping the table columnar while allowing on‑demand blob retrieval.
{ data: ... }
{ uri: ... }
{ uri: ..., position: ..., size: ... }Avoiding data duplication when external files already exist
When source files reside in S3, HDFS, or local disks, BlobV2 can store only the URI (or a slice) without copying the data.
Row 1 → s3://bucket/images/0001.jpg
Row 2 → s3://bucket/images/0002.jpg
Row 3 → s3://bucket/videos/a.mp4 (range)Use Blob.from_uri(...) for explicit external blobs; it also accepts position and size for range references.
Packing multiple blobs inside a single container file
By recording position and size for each member of a tar archive, many payloads can share one physical file.
Blob.from_uri(container_uri, position=m.offset_data, size=m.size)This suits scenarios such as images inside a tar, audio segments in a large file, video intervals, or archived sub‑objects.
Preventing full‑dataset loads during reads
Reading is recommended via take_blobs, which returns file‑like handles selected by exactly one of ids, indices, or addresses.
ds = lance.dataset("./blobs_v22.lance")
blobs = ds.take_blobs("blob", indices=[0, 1])
with blobs[0] as f:
data = f.read()Example of lazy video decoding with av.open(blob) is provided in the source.
import av, lance
ds = lance.dataset("./videos_v22.lance")
blob = ds.take_blobs("video", indices=[0])[0]
start_ms, end_ms = 500, 1000
with av.open(blob) as container:
stream = container.streams.video[0]
stream.codec_context.skip_frame = "NONKEY"
start = (start_ms / 1000) / stream.time_base
end = (end_ms / 1000) / stream.time_base
container.seek(int(start), stream=stream)
for frame in container.decode(stream):
if frame.time is not None and frame.time > end_ms / 1000:
break
# process frame
passImplementation
Blob – the user‑side logical value
@dataclass(frozen=True)
class Blob:
data: Optional[bytes] = None
uri: Optional[str] = None
position: Optional[int] = None
size: Optional[int] = NoneValidation performed in __post_init__ enforces:
Cannot have both data and uri. uri must be non‑null when used.
If position / size are set, uri is required. position and size must be set together.
Inline data cannot carry external slice metadata.
Helper constructors:
Blob.from_bytes(...)
Blob.from_uri(...)
Blob.empty()BlobType – PyArrow ExtensionType
BlobV2 registers an extension type named lance.blob.v2 with the struct storage shown earlier.
try:
pa.register_extension_type(BlobType())
except pa.ArrowKeyError:
passblob_field – schema helper
def blob_field(name: str, *, nullable: bool = True) -> pa.Field:
"""Construct an Arrow field for a Lance blob column."""
return pa.field(name, BlobType(), nullable=nullable)Thus blob_field("image") is equivalent to pa.field("image", BlobType(), nullable=True).
blob_array – converting Python values to a BlobArray
def blob_array(values: list[Any]) -> BlobArray:
"""Construct a blob array from Python values.
Each value must be one of:
- bytes‑like: inline bytes
- str: an external URI
- Blob: explicit inline/uri/empty
- None: null
"""
return BlobArray.from_pylist(values)The function splits inputs into four parallel arrays ( data_values, uri_values, position_values, size_values) and builds a pa.ExtensionArray via pa.ExtensionArray.from_storage(BlobType(), storage).
Using BlobV2
Write inline bytes (quick‑start)
import lance, pyarrow as pa
from lance import blob_array, blob_field
schema = pa.schema([
pa.field("id", pa.int64()),
blob_field("blob"),
])
table = pa.table({
"id": [1],
"blob": blob_array([b"hello blob v2"]),
}, schema=schema)
ds = lance.write_dataset(
table,
"./blobs_v22.lance",
data_storage_version="2.2",
)
blob = ds.take_blobs("blob", indices=[0])[0]
with blob as f:
assert f.read() == b"hello blob v2"Mix inline bytes, external URIs, ranges, and nulls in one column
import lance, pyarrow as pa
from lance import Blob, blob_array, blob_field
schema = pa.schema([
pa.field("id", pa.int64()),
blob_field("blob", nullable=True),
])
rows = pa.table({
"id": [1, 2, 3, 4],
"blob": blob_array([
b"inline-bytes",
"s3://bucket/path/video.mp4",
Blob.from_uri("s3://bucket/archive.tar", position=4096, size=8192),
None,
]),
}, schema=schema)
ds = lance.write_dataset(
rows,
"./blobs_v22.lance",
data_storage_version="2.2",
)Packed external blobs example
import io, tarfile
from pathlib import Path
import lance, pyarrow as pa
from lance import Blob, blob_array, blob_field
payloads = {"a.bin": b"alpha", "b.bin": b"bravo", "c.bin": b"charlie"}
with tarfile.open("container.tar", "w") as tf:
for name, data in payloads.items():
info = tarfile.TarInfo(name)
info.size = len(data)
tf.addfile(info, io.BytesIO(data))
blob_values = []
with tarfile.open("container.tar", "r") as tf:
container_uri = Path("container.tar").resolve().as_uri()
for name in payloads:
m = tf.getmember(name)
blob_values.append(
Blob.from_uri(container_uri, position=m.offset_data, size=m.size)
)
schema = pa.schema([
pa.field("name", pa.utf8()),
blob_field("blob"),
])
rows = pa.table({
"name": list(payloads.keys()),
"blob": blob_array(blob_values),
}, schema=schema)
ds = lance.write_dataset(
rows,
"./packed_blobs_v22.lance",
data_storage_version="2.2",
allow_external_blob_outside_bases=True,
)Reading with take_blobs
Three selectors are supported: indices: positional reads within the current snapshot. ids: logical row‑id reads. addresses: physical address reads (mainly for debugging).
# By indices
import lance
ds = lance.dataset("./blobs_v22.lance")
blobs = ds.take_blobs("blob", indices=[0, 1])
with blobs[0] as f:
data = f.read()
# By row ids
row_ids = ds.to_table(columns=[], with_row_id=True).column("_rowid").to_pylist()
blobs = ds.take_blobs("blob", ids=row_ids[:2])
# By physical addresses
row_addrs = ds.to_table(columns=[], with_row_address=True).column("_rowaddr").to_pylist()
blobs = ds.take_blobs("blob", addresses=row_addrs[:2])external_blob_mode: reference vs ingest
reference
The dataset stores only the external URI (e.g., Blob.from_uri("s3://bucket/images/0001.jpg")). Advantages: fast writes, no data duplication, smaller dataset size, reuse of existing storage. Drawbacks: the original file must remain accessible, permissions must stay valid, and dataset portability is reduced.
ingest
The write process reads the external bytes and stores them inside Lance‑managed storage, making the dataset self‑contained even if the original file is later deleted.
When to choose which mode
Use ingest when:
Dataset should be self‑contained.
External file lifecycle is uncertain.
Preparing to migrate the dataset.
External files are only temporary inputs.
Use reference when:
External objects have a stable lifecycle.
Copying large files is undesirable.
Multiple datasets share the same original blobs.
Object‑storage paths and permissions are under control.
allow_external_blob_outside_bases
By default, external URIs must map to a registered non‑dataset‑root base path. Setting allow_external_blob_outside_bases=True relaxes this restriction, allowing URIs outside the registered bases, but it complicates portability, permission management, and lifecycle handling.
Differences between BlobV2 and legacy Blob
Key differences:
Lance file format: legacy 0.1, 2.0, 2.1; BlobV2 2.2+.
Arrow type: legacy pa.large_binary() + metadata; BlobV2 pa.ExtensionType.
Type identifier: legacy lance-encoding:blob; BlobV2 lance.blob.v2.
Construction: legacy manually add metadata to a field; BlobV2 use blob_field + blob_array.
Recommended for new datasets: legacy No; BlobV2 Yes.
External URI expression: legacy less structured; BlobV2 native support.
External URI range support: legacy less structured; BlobV2 supports position + size.
Write support (>=2.2): legacy No; BlobV2 Yes.
Read interface: legacy blob‑related APIs; BlobV2 take_blobs.
Version compatibility:
data_storage_version 0.1, 2.0, 2.1: legacy blob metadata supported; BlobV2 not supported.
data_storage_version 2.2+: legacy blob write not supported; BlobV2 write/read supported and recommended.
Writing legacy blobs
import lance, pyarrow as pa
schema = pa.schema([
pa.field("id", pa.int64()),
pa.field(
"video",
pa.large_binary(),
metadata={"lance-encoding:blob": "true"},
),
])
table = pa.table({"id": [1, 2], "video": [b"foo", b"bar"]}, schema=schema)
ds = lance.write_dataset(table, "./legacy_blob_dataset", data_storage_version="2.1")Legacy blobs require file formats 0.1, 2.0, or 2.1 and the specific metadata flag.
Writing BlobV2
from lance import blob_array, blob_field
import pyarrow as pa, lance
schema = pa.schema([
pa.field("id", pa.int64()),
blob_field("blob"),
])
table = pa.table({"id": [1], "blob": blob_array([b"hello blob v2"])}, schema=schema)
ds = lance.write_dataset(table, "./blobs_v22.lance", data_storage_version="2.2")BlobV2 replaces pa.large_binary() + metadata {"lance-encoding": "blob"} with the explicit extension type pa.ExtensionType("lance.blob.v2"), offering a clearer, strongly‑typed design.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Tribe
Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
