Tagged articles

Parquet

27 articles · Page 1 of 1

May 11, 2026 · Databases

When Search Meets Rust: A Deep Dive into INFINI Pizza, the Next‑Gen Real‑Time Search Engine

This article analytically examines INFINI Pizza, a Rust‑implemented distributed search database, detailing its design philosophy, hierarchical data model, rolling‑partition‑shard architecture, share‑nothing + io_uring I/O stack, true real‑time indexing, in‑place partial updates, AI‑native hybrid search capabilities, ecosystem components, and a point‑by‑point comparison with Elasticsearch.

AI-nativeDistributedParquet

0 likes · 20 min read

When Search Meets Rust: A Deep Dive into INFINI Pizza, the Next‑Gen Real‑Time Search Engine

Past Memory Big Data

Mar 27, 2026 · Big Data

Why AI Workloads Require Rebuilding Parquet: A Deep Dive into Lance

The article explains how traditional Parquet‑based lakehouse architectures, optimized for large‑scale scans, struggle with AI workloads that need ultra‑low‑latency random access, and how Lance redesigns the storage format, indexing and write path to provide O(1) addressing, native vector support, and seamless integration with native execution engines.

AI workloadsData LakeLance

0 likes · 12 min read

Why AI Workloads Require Rebuilding Parquet: A Deep Dive into Lance

Big Data Technology Tribe

Mar 22, 2026 · Big Data

How Dremel Encodes Nested Data: Definition & Repetition Levels Explained

This article breaks down Dremel's columnar encoding for nested data, detailing the definition‑level and repetition‑level concepts, showing step‑by‑step examples of encoding and reconstructing JSON‑like schemas, and explaining the limits of single‑column reconstruction.

Apache ArrowColumnar StorageDremel

0 likes · 9 min read

How Dremel Encodes Nested Data: Definition & Repetition Levels Explained

Data STUDIO

Dec 5, 2025 · Big Data

Why Parquet Is the Default Choice for Big Data Storage

The article explains how Apache Parquet’s columnar layout, multi‑level row‑group structure, projection and predicate push‑down, and advanced compression and encoding make it the high‑performance, space‑efficient storage format that powers modern big‑data ecosystems and tools like Spark, Python pandas, and ClickHouse.

Big DataClickHouseColumnar Storage

0 likes · 11 min read

Why Parquet Is the Default Choice for Big Data Storage

Data STUDIO

Nov 25, 2025 · Big Data

Why Parquet Is the Faster, Lighter, Safer Alternative to CSV in Python

The article explains why CSV becomes a bottleneck for large‑scale data, demonstrates how Parquet’s columnar, typed, and compressed format dramatically reduces storage, speeds up reads, and improves data safety, and provides step‑by‑step Python code for migrating and benchmarking the switch.

CSVData EngineeringDuckDB

0 likes · 18 min read

Why Parquet Is the Faster, Lighter, Safer Alternative to CSV in Python

Data STUDIO

Nov 12, 2025 · Databases

7 Reusable DuckDB SQL Patterns for Fast Local Data Analysis

This article presents seven practical DuckDB SQL patterns—querying files directly, treating partition folders as tables, deduplicating with QUALIFY, computing rolling metrics with window functions, pivot/unpivot, handling JSON arrays, and exporting results to Parquet—plus tips and a mini case study that show how to turn a notebook into a lightweight OLAP engine without leaving the Python environment.

DuckDBParquetPivot

0 likes · 12 min read

7 Reusable DuckDB SQL Patterns for Fast Local Data Analysis

Big Data Technology Tribe

Oct 18, 2025 · Databases

How Adaptive Structural Encoding Boosts Random Access in Columnar Storage

This article examines how adaptive structural encoding in columnar formats like Lance dramatically improves random‑access performance on NVMe storage, compares it with Apache Parquet and Arrow, and discusses the trade‑offs between scan speed, memory usage, and compression.

Columnar StorageLanceNVMe

0 likes · 17 min read

How Adaptive Structural Encoding Boosts Random Access in Columnar Storage

Baidu Geek Talk

Sep 24, 2025 · Big Data

How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings

This article explains how Baidu’s Feed real‑time data warehouse was rebuilt using a pure streaming architecture, detailing the limitations of the previous stream‑batch design, the technical solutions—including core/non‑core data separation, metric calculation in streaming, and Parquet storage with Apache Arrow—and the resulting cost reductions, latency improvements, and future roadmap.

Apache ArrowBatch ProcessingParquet

0 likes · 17 min read

How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings

Architect

Jul 7, 2025 · Big Data

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.

Data WarehouseETLParquet

0 likes · 21 min read

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

DataFunTalk

May 29, 2025 · Databases

Introducing DuckLake: An Integrated Data Lake and Catalog Format Powered by SQL

DuckDB's DuckLake is an open‑standard, SQL‑driven data lake and catalog format that simplifies lakehouse architecture by managing metadata in a database while storing data in scalable Parquet files, offering multi‑user collaboration, time‑travel queries, and MIT licensing.

DatabasesDuckDBParquet

0 likes · 4 min read

Introducing DuckLake: An Integrated Data Lake and Catalog Format Powered by SQL

360 Smart Cloud

May 23, 2024 · Big Data

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

The article introduces Archer, a new big‑data warehouse engine built on Iceberg that adds an inverted‑index mechanism using Tantivy to provide full‑text and JSON search, storage‑compute separation, and significant performance gains over traditional Elasticsearch and Iceberg connectors.

Archer EngineBig DataParquet

0 likes · 9 min read

Archer Engine: Integrating Inverted Index with Iceberg for Scalable Big Data Log Analytics

Volcano Engine Developer Services

Jun 20, 2022 · Big Data

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

ByteDance tackled massive feature‑storage challenges by replacing row‑based HDFS files with columnar Parquet and the Iceberg table format, enabling schema evolution, selective reads, efficient backfill, and training optimizations that cut storage costs by over 40% and reduced CPU and network I/O dramatically.

Big DataData LakeIceberg

0 likes · 13 min read

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

Baidu Geek Talk

Jun 15, 2022 · Big Data

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

The article proposes replacing the traditional multi‑layered data‑warehouse architecture (ODS‑DWD‑DWS‑ADS) with a single, column‑store wide‑table per business theme, achieving roughly 30 % storage savings and faster queries, while acknowledging higher ETL complexity, back‑tracking costs, and production timing challenges.

Big DataData WarehouseETL

0 likes · 11 min read

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

Python Crawling & Data Mining

Mar 10, 2022 · Databases

Export MongoDB Data to CSV, Excel, JSON, Parquet with mongo2file – A Complete Guide

This article introduces the mongo2file library for converting MongoDB collections into various table formats such as CSV, Excel, JSON, pickle, feather, and parquet, explains its PyArrow dependency, shows installation and quick‑start code, discusses performance bottlenecks, and provides a full reference API.

ExcelMongoDBParquet

0 likes · 11 min read

Export MongoDB Data to CSV, Excel, JSON, Parquet with mongo2file – A Complete Guide

Big Data Technology Architecture

Aug 24, 2021 · Big Data

An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC

This article provides a comprehensive introduction to Apache Parquet, covering its origins, columnar storage advantages, nested schema support, internal architecture, storage model components, comparison with ORC, and practical tools for inspecting Parquet files.

Columnar StorageHadoopORC Comparison

0 likes · 10 min read

An Overview of Apache Parquet: Architecture, Storage Model, and Comparison with ORC

Big Data Technology Architecture

Apr 5, 2021 · Big Data

Understanding Apache Iceberg: Table Format Architecture, Comparison with Hive Metastore, and Business Benefits

This article introduces Apache Iceberg as an open table format for massive analytic datasets, explains its underlying concepts such as schema, partitioning, statistics, and read/write APIs, compares it with Hive Metastore, outlines its ACID commit process, highlights the performance and operational advantages for big‑data workloads, and previews upcoming community features.

ACIDApache IcebergMetadata

0 likes · 19 min read

Understanding Apache Iceberg: Table Format Architecture, Comparison with Hive Metastore, and Business Benefits

Laravel Tech Community

Feb 28, 2021 · Big Data

Apache Beam 2.28.0 Release Highlights and New Features

Apache Beam 2.28.0 introduces extensive Parquet support, new hash functions in BeamSQL and ZetaSQL, ApproximateDistinct via HLL, enhanced I/O connectors including SpannerIO for Numeric fields, ParquetIO schema support, KafkaTableProvider thrift, HadoopFormatIO key/value cloning skip, and various other improvements.

Apache BeamBatchBig Data

0 likes · 3 min read

Apache Beam 2.28.0 Release Highlights and New Features

Big Data Technology & Architecture

Jan 5, 2021 · Big Data

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

This article details a real‑world investigation of Spark SQL job latency on a YARN cluster, explains how switching the scheduler to FAIR mode, creating resource pools, and consolidating small Parquet files dramatically reduced scheduler delay and cut execution time from over 100 seconds to under 20 seconds.

ParquetPerformance OptimizationScheduler

0 likes · 13 min read

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

Big Data Technology & Architecture

Nov 26, 2020 · Big Data

Understanding Apache Parquet: Architecture, Data Model, and Performance

This article provides a comprehensive overview of Apache Parquet, covering its modular architecture, nested data model, striping/assembly and definition level algorithms, file format details, push‑down optimizations, performance benchmarks, and the project's evolution within the big‑data ecosystem.

Columnar StorageParquetPushdown Optimization

0 likes · 18 min read

Understanding Apache Parquet: Architecture, Data Model, and Performance

DataFunTalk

Oct 29, 2020 · Big Data

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft transformed its legacy data pipeline by designing a cloud‑native, Flink‑based near real‑time analytics platform that ingests billions of events, writes Parquet files to S3, leverages Presto for interactive queries, and implements multi‑stage non‑blocking ETL, fault‑tolerant back‑fill, and extensive performance optimizations.

AWSData LakeETL

0 likes · 12 min read

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Architect

May 21, 2020 · Big Data

Parallel Execution of Multiple Spark Jobs to Optimize Resource Utilization and Reduce Parquet File Count

This article examines how to run several Spark jobs concurrently on a shared SparkContext, balancing full CPU‑vcore utilization with the need to generate fewer Parquet files, and presents practical experiments, scheduling strategies, and performance results.

Big DataJob SchedulingParquet

0 likes · 12 min read

Parallel Execution of Multiple Spark Jobs to Optimize Resource Utilization and Reduce Parquet File Count

Big Data Technology Architecture

May 19, 2020 · Big Data

An Overview of Apache Parquet: Architecture, Features, and Comparison with ORC

Apache Parquet is a language‑agnostic, columnar storage format for the Hadoop ecosystem that offers high compression, efficient I/O through column and predicate push‑down, nested‑structure support, and a three‑layer architecture, and is compared with ORC while providing tooling for schema inspection.

Apache HadoopColumnar StorageData Formats

0 likes · 9 min read

An Overview of Apache Parquet: Architecture, Features, and Comparison with ORC

Big Data Technology Architecture

Apr 24, 2020 · Big Data

Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

The article introduces Kyligence's Kylin on Parquet solution, explains its plug‑in architecture, reasons for replacing HBase with Parquet, details the new Spark‑based build and query engines, auto‑tuning, global dictionary, fault‑tolerance features, and presents performance comparisons with Kylin 3.0.

Apache KylinData WarehouseParquet

0 likes · 11 min read

Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

21CTO

Nov 27, 2019 · Big Data

Choosing the Right File Format for Big Data: CSV, JSON, Parquet & Avro Explained

This article compares CSV, JSON, Parquet, and Avro file formats, outlining their structures, advantages, and drawbacks, and explains how Apache Spark supports each format for efficient big‑data storage and processing.

Apache SparkAvroCSV

0 likes · 8 min read

Choosing the Right File Format for Big Data: CSV, JSON, Parquet & Avro Explained

Big Data Technology & Architecture

Jun 10, 2019 · Big Data

Understanding Spark SQL: Origin, Features, and Columnar Storage

This article explains the evolution of Spark SQL from Shark, describes its key features such as SchemaRDD and in‑memory columnar storage, compares row‑based and column‑based storage, and provides practical Scala code examples for creating DataFrames and loading data from various sources.

Big DataJDBCParquet

0 likes · 16 min read

Understanding Spark SQL: Origin, Features, and Columnar Storage

Big Data Technology Architecture

Jun 9, 2019 · Big Data

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

This article provides a comprehensive overview of Apache Parquet, covering its purpose, architectural components, nested data model, file structure, practical Hive commands for creating and inspecting Parquet tables, and a brief introduction to the TPC‑DS benchmark for performance testing.

Columnar StorageHiveParquet

0 likes · 8 min read

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

Qunar Tech Salon

Feb 26, 2017 · Big Data

Comparative Analysis of Big Data Storage and Query Solutions

This article reviews major big‑data storage and query architectures—including HBase, Dremel/Parquet, pre‑aggregation systems, Lucene, and the custom Tindex solution—evaluating their strengths, weaknesses, and suitability for real‑time, high‑volume analytical workloads.

Big DataHBaseLucene

0 likes · 20 min read

Comparative Analysis of Big Data Storage and Query Solutions