Tagged articles

Apache Arrow

20 articles · Page 1 of 1

May 25, 2026 · Big Data

Polars vs Pandas: Is Switching Worth It for Ten‑Million‑Row Datasets?

The article shows that Polars, a query‑compiling DataFrame library, can accelerate ten‑million‑row GroupBy workloads by 6‑10× compared with Pandas, explains the underlying optimizer, Arrow columnar engine and Rust parallelism, provides a 20‑item syntax map, three real migration scenarios, streaming for out‑of‑memory data, and AI‑pipeline use cases, and offers a step‑by‑step migration guide.

Apache ArrowDataFramesLazy Evaluation

0 likes · 21 min read

Polars vs Pandas: Is Switching Worth It for Ten‑Million‑Row Datasets?

Big Data Technology Tribe

Mar 22, 2026 · Big Data

How Dremel Encodes Nested Data: Definition & Repetition Levels Explained

This article breaks down Dremel's columnar encoding for nested data, detailing the definition‑level and repetition‑level concepts, showing step‑by‑step examples of encoding and reconstructing JSON‑like schemas, and explaining the limits of single‑column reconstruction.

Apache ArrowColumnar StorageDremel

0 likes · 9 min read

How Dremel Encodes Nested Data: Definition & Repetition Levels Explained

Big Data Technology Tribe

Feb 27, 2026 · Fundamentals

What Is pyarrow.Schema and How to Use It?

pyarrow.Schema is the Python representation of an Arrow table schema, describing column names, types, nullability, and other metadata, and it is essential for defining, inspecting, serializing, and interfacing data structures across libraries like Pandas, Polars, and Arrow‑based query engines.

Apache ArrowData StructuresPyArrow

0 likes · 4 min read

What Is pyarrow.Schema and How to Use It?

Baidu Geek Talk

Sep 24, 2025 · Big Data

How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings

This article explains how Baidu’s Feed real‑time data warehouse was rebuilt using a pure streaming architecture, detailing the limitations of the previous stream‑batch design, the technical solutions—including core/non‑core data separation, metric calculation in streaming, and Parquet storage with Apache Arrow—and the resulting cost reductions, latency improvements, and future roadmap.

Apache ArrowBatch ProcessingParquet

0 likes · 17 min read

How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings

360 Tech Engineering

Oct 17, 2024 · Databases

Introducing DataFusion: A High‑Performance Rust‑Based Query Engine Powered by Apache Arrow

This article explains DataFusion, a Rust‑written, Arrow‑based query engine that offers high performance, extensibility, and seamless integration with various data sources, detailing its architecture, execution model, Rust advantages, and practical usage examples for building modern data‑warehouse solutions.

Apache ArrowData WarehouseDataFusion

0 likes · 15 min read

Introducing DataFusion: A High‑Performance Rust‑Based Query Engine Powered by Apache Arrow

360 Zhihui Cloud Developer

Sep 9, 2024 · Big Data

Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow

This article introduces DataFusion, a high‑performance, Rust‑based query engine that leverages Apache Arrow’s columnar memory format to enable fast, extensible data processing across multiple storage formats and cloud sources, explains its architecture, execution model, and provides practical Rust code examples for custom extensions.

Apache ArrowBig DataDataFusion

0 likes · 16 min read

Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow

Python Programming Learning Circle

Aug 13, 2024 · Big Data

What’s New in pandas 2.0: Arrow Backend, Copy‑On‑Write, and Performance Improvements

The article reviews pandas 2.0’s major upgrades—including an Apache Arrow backend that speeds up CSV reads by over 30×, new Arrow dtypes, a nullable‑numpy dtype for missing values, a copy‑on‑write memory model, optional dependencies, and benchmark comparisons with ydata‑profiling—highlighting the library’s enhanced performance, flexibility, and interoperability for data‑intensive Python workflows.

Apache ArrowCopy-on-WritePandas

0 likes · 15 min read

What’s New in pandas 2.0: Arrow Backend, Copy‑On‑Write, and Performance Improvements

DataFunSummit

Jun 21, 2024 · Big Data

Building a Complete Data System with Apache Arrow: Architecture, Dynamic Schema Modeling, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow and its columnar in‑memory format, describes dynamic read‑time modeling, outlines the system’s execution flow, storage and indexing strategies, and shares practical tips and extensions for building scalable big‑data solutions.

AceroApache ArrowBig Data

0 likes · 20 min read

Building a Complete Data System with Apache Arrow: Architecture, Dynamic Schema Modeling, and Practical Tips

DataFunSummit

Apr 23, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow’s columnar in‑memory format and its zero‑copy advantages, describes how to model data at read time, outlines the execution flow with Acero and SQL planning, and shares practical tips and extensions for building robust, dynamic‑schema data platforms.

AceroApache ArrowBig Data

0 likes · 20 min read

Building a Data System with Apache Arrow: Design, Implementation, and Practical Tips

Sohu Tech Products

Mar 6, 2024 · Big Data

Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution

The article explains how Apache Arrow’s columnar, cross‑language in‑memory format enables high‑performance, interoperable data systems—replacing traditional row‑oriented databases—by supporting dynamic schemas, zero‑copy data exchange, efficient indexing, Acero‑based query execution, and Flight/ADBC connectivity, while offering practical guidance and highlighting challenges.

Apache ArrowBig DataColumnar Storage

0 likes · 20 min read

Building Data Systems with Apache Arrow: Architecture, Memory Format, and Execution

DataFunTalk

Feb 28, 2024 · Big Data

Building a Data System with Apache Arrow: Design, Modeling, and Execution

This article explains why new data systems are needed, introduces Apache Arrow and its columnar in‑memory format, describes read‑time modeling and dynamic schema handling, and shows how Arrow can be used to build a complete data processing pipeline with indexing, SQL planning, and zero‑copy data exchange.

Apache ArrowBig DataColumnar Storage

0 likes · 20 min read

Building a Data System with Apache Arrow: Design, Modeling, and Execution

Sohu Tech Products

Jan 24, 2024 · Databases

Optimizing Database Expression Evaluation with JIT Technology Using Gandiva

The article explains how database expression evaluation—especially in WHERE and SELECT clauses—can be dramatically accelerated by replacing interpreted AST traversal with Just‑In‑Time compilation using Apache Gandiva, which leverages LLVM to generate SIMD‑optimized machine code for Arrow columnar data, and discusses extensions such as timestamp, array, higher‑order functions, and UDF support.

Apache ArrowApache GandivaExpression Evaluation

0 likes · 17 min read

Optimizing Database Expression Evaluation with JIT Technology Using Gandiva

DataFunTalk

Jan 15, 2024 · Databases

Optimizing Database Expression Evaluation with JIT Compilation Using Gandiva

This article explains how Just‑In‑Time (JIT) compilation, particularly via the Gandiva expression compiler built on LLVM and Apache Arrow, can dramatically accelerate database expression evaluation by transforming abstract syntax trees into native vectorized code, addressing traditional interpretation bottlenecks and improving CPU‑bound query performance.

Apache ArrowExpression EvaluationGandiva

0 likes · 17 min read

Optimizing Database Expression Evaluation with JIT Compilation Using Gandiva

DataFunSummit

Dec 9, 2023 · Databases

Interview with Wu Li on Columnar Storage, JIT Compilation, and Push Mode in Database Development

The article presents an interview with Wu Li, a senior R&D engineer at Shanghai Yanhuang Data, discussing how columnar storage, JIT compilation, and push‑mode execution are reshaping database performance in the era of big‑data analytics and evolving hardware constraints.

Apache ArrowColumnar StorageDatabases

0 likes · 10 min read

Interview with Wu Li on Columnar Storage, JIT Compilation, and Push Mode in Database Development

DataFunTalk

Dec 8, 2023 · Databases

Interview with Wu Li on Database Evolution: Columnar Storage, JIT Compilation, and Push Mode

The article presents an interview with Wu Li, a research engineer at Shanghai Yanhuang Data, discussing how hardware limits have driven database evolution toward columnar storage, the adoption of Apache Arrow and Gandiva for SIMD‑enabled JIT compilation, and the shift from pull to push processing modes to improve OLAP performance.

Apache ArrowGandivaOLAP

0 likes · 10 min read

Interview with Wu Li on Database Evolution: Columnar Storage, JIT Compilation, and Push Mode

DataFunSummit

Oct 24, 2023 · Big Data

Using Apache Arrow to Quickly Build Modern Data Systems

This announcement introduces Li Chenxi, a big‑data R&D engineer, and outlines his talk on leveraging Apache Arrow’s columnar in‑memory format to efficiently construct modern, read‑time modeling data systems, highlighting key features, ecosystem, and practical implementation benefits for the audience.

Apache ArrowBig DataColumnar Memory

0 likes · 2 min read

Using Apache Arrow to Quickly Build Modern Data Systems

phodal

Oct 31, 2022 · Frontend Development

How Perspective Delivers Real‑Time Financial Visualizations with WASM and Apache Arrow

The article examines the open‑source Perspective library—originating from J.P. Morgan’s FinOS—and its multi‑language architecture that combines C++/WASM, Rust, JavaScript, and Python to deliver high‑performance, framework‑free data visualizations for real‑time financial analytics, highlighting its use of Apache Arrow, Web Components, and Jupyter integration.

Apache ArrowC#Data Visualization

0 likes · 8 min read

How Perspective Delivers Real‑Time Financial Visualizations with WASM and Apache Arrow

DataFunSummit

Nov 20, 2021 · Artificial Intelligence

Design Dimensions of Next‑Generation AI Platforms: Programming Languages, Runtime Environments, and Model Deployment

The article examines three key design dimensions of modern AI platforms—choice of programming language, runtime environment isolation, and model deployment—highlighting how Python’s dominance, container‑based resource management, and efficient data sharing shape platform architecture and performance.

AI PlatformsApache ArrowModel Deployment

0 likes · 13 min read

Design Dimensions of Next‑Generation AI Platforms: Programming Languages, Runtime Environments, and Model Deployment

Big Data Technology Architecture

Aug 8, 2020 · Big Data

Performance Comparison of SparkR with Vectorized Execution Using Apache Arrow

This article explains how SparkR’s performance compares to native Spark APIs, shows the slowdown caused by JVM‑R serialization, and demonstrates how enabling Apache Arrow’s vectorized execution in Spark 3.0 can accelerate SparkR operations by up to dozens of times.

Apache ArrowSparkRVectorized Execution

0 likes · 7 min read

Performance Comparison of SparkR with Vectorized Execution Using Apache Arrow

Laravel Tech Community

Aug 1, 2020 · Big Data

Apache Arrow 1.0.0 Released with New Columnar Format Features

Apache Arrow 1.0.0, the 18th major release, introduces binary‑stable columnar format changes, new metadata version V5, unsigned dictionary indices, a Feature enum, optional LZ4/ZStandard compression, expanded decimal bitWidth support, removal of validity bitmaps, and broader language bindings, enhancing big‑data analytics performance.

Apache ArrowData Interoperabilitycolumnar format

0 likes · 3 min read