Debugging Lance‑Spark & Lance‑Ray on HDFS: Build Wheels and Fix Common Errors

This guide walks through building custom pylance and lance‑namespace wheels to enable HDFS support, resolves common ModuleNotFoundError, Hive dependency, and native library issues, clarifies correct table_id usage, and provides a complete Python script that reads and modifies a Lance dataset with Ray.

Big Data Technology Tribe
Big Data Technology Tribe
Big Data Technology Tribe
Debugging Lance‑Spark & Lance‑Ray on HDFS: Build Wheels and Fix Common Errors

Testing Objectives

Write a dataset to HDFS using lance‑spark, then read, write and add columns with lance‑ray to verify interoperability. After modifications, read back with lance‑spark to confirm changes.

Test Preparation

Compile a pylance wheel with HDFS support, skipping auditwheel.

Compile the lance_namespace_impls wheel to enable the hive2 namespace.

Run the provided Python test script.

Building pylance with HDFS support

Community code adds HDFS support; use the internal source and build with maturin:

cd lance/python
maturin build --release --auditwheel skip

The --auditwheel skip flag prevents non‑system dynamic libraries from being packaged, avoiding “could not load xxx.so” errors.

Compiling lance_namespace_impls for hive2

The default lance-namespace only provides DIR and REST. To use hive2 (or hive3, glue, unity) compile the implementation project:

cd lance-namespace-impls/python
make clean-python
make build-python

The wheel appears in lance-namespace-impls/python/dist, e.g. lance_namespace_impls-0.1.0-py3-none-any.whl. Install it:

pip install lance_namespace_impls-0.1.0-py3-none-any.whl

Installing Hive client dependencies

If the Hive metastore client is missing, install it:

pip install hive_metastore_client

Correct table identifier for read_lance

The table_id argument must not include the catalog name. Use:

["lance_test_database", "test_overwrite_table1"]

instead of ["lance", "lance_test_database", "test_overwrite_table1"].

Resolving native library loading error (rsmi_shut_down)

Run the script with LD_DEBUG to locate missing symbols, then rebuild the wheel with --auditwheel skip to omit the bundled libraries that cause the undefined symbol rsmi_shut_down error.

Using add_columns without conflicting arguments

add_columns

accepts either a uri **or** the combination of namespace and table_id. Provide uri=None when using the namespace form:

add_columns(
    uri=None,
    namespace=namespace,
    table_id=["lance_test_database", "test_overwrite_table1"],
    transform=add_computed_column,
    concurrency=4
)

Complete test script

The script below demonstrates:

Initializing a local Ray cluster.

Registering the hive2 namespace implementation.

Connecting to the Hive metastore.

Reading a Lance dataset stored on HDFS.

Adding a computed column using a user‑defined function.

import lance_namespace as ln
import pyarrow as pa
import ray
from lance_ray import read_lance, add_columns

# Start a local Ray cluster
ray.init(num_cpus=4, ignore_reinit_error=True, include_dashboard=False)
print("Ray local cluster initialized")

# HDFS warehouse and Hive metastore address
warehouse_dir = "hdfs://ns1/user/platform_rd_two/zhb"
hms_address = "thrift://yjv-hivemetastoretest-001:9083"

# Register hive2 implementation and connect
ln.register_namespace_impl("hive2", "lance_namespace_impls.hive2.Hive2Namespace")
namespace = ln.connect("hive2", {"uri": hms_address, "root": warehouse_dir})

# Read the Lance table
ds = read_lance(
    namespace=namespace,
    table_id=["lance_test_database", "test_overwrite_table1"]
)
print(ds.schema())
print(ds.show())

# Define a column‑adding UDF
def add_computed_column(batch: pa.RecordBatch) -> pa.RecordBatch:
    df = batch.to_pandas()
    df["computed"] = 2 + df["id"]
    return pa.RecordBatch.from_pandas(df[["computed"]])

# Add the new column
add_columns(
    uri=None,
    namespace=namespace,
    table_id=["lance_test_database", "test_overwrite_table1"],
    transform=add_computed_column,
    concurrency=4
)
Pythonhdfswheellance-raylance-spark
Big Data Technology Tribe
Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.