Tagged articles
26 articles
Page 1 of 1
Mingyi World Elasticsearch
Mingyi World Elasticsearch
May 11, 2026 · Databases

When Search Meets Rust: A Deep Dive into INFINI Pizza, the Next‑Gen Real‑Time Search Engine

This article analytically examines INFINI Pizza, a Rust‑implemented distributed search database, detailing its design philosophy, hierarchical data model, rolling‑partition‑shard architecture, share‑nothing + io_uring I/O stack, true real‑time indexing, in‑place partial updates, AI‑native hybrid search capabilities, ecosystem components, and a point‑by‑point comparison with Elasticsearch.

AI-nativeDistributedParquet
0 likes · 20 min read
When Search Meets Rust: A Deep Dive into INFINI Pizza, the Next‑Gen Real‑Time Search Engine
Data STUDIO
Data STUDIO
Dec 5, 2025 · Big Data

Why Parquet Is the Default Choice for Big Data Storage

The article explains how Apache Parquet’s columnar layout, multi‑level row‑group structure, projection and predicate push‑down, and advanced compression and encoding make it the high‑performance, space‑efficient storage format that powers modern big‑data ecosystems and tools like Spark, Python pandas, and ClickHouse.

Big DataClickHouseColumnar Storage
0 likes · 11 min read
Why Parquet Is the Default Choice for Big Data Storage
Data STUDIO
Data STUDIO
Nov 25, 2025 · Big Data

Why Parquet Is the Faster, Lighter, Safer Alternative to CSV in Python

The article explains why CSV becomes a bottleneck for large‑scale data, demonstrates how Parquet’s columnar, typed, and compressed format dramatically reduces storage, speeds up reads, and improves data safety, and provides step‑by‑step Python code for migrating and benchmarking the switch.

CSVDuckDBParquet
0 likes · 18 min read
Why Parquet Is the Faster, Lighter, Safer Alternative to CSV in Python
Data STUDIO
Data STUDIO
Nov 12, 2025 · Databases

7 Reusable DuckDB SQL Patterns for Fast Local Data Analysis

This article presents seven practical DuckDB SQL patterns—querying files directly, treating partition folders as tables, deduplicating with QUALIFY, computing rolling metrics with window functions, pivot/unpivot, handling JSON arrays, and exporting results to Parquet—plus tips and a mini case study that show how to turn a notebook into a lightweight OLAP engine without leaving the Python environment.

DuckDBJSONParquet
0 likes · 12 min read
7 Reusable DuckDB SQL Patterns for Fast Local Data Analysis
Baidu Geek Talk
Baidu Geek Talk
Sep 24, 2025 · Big Data

How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings

This article explains how Baidu’s Feed real‑time data warehouse was rebuilt using a pure streaming architecture, detailing the limitations of the previous stream‑batch design, the technical solutions—including core/non‑core data separation, metric calculation in streaming, and Parquet storage with Apache Arrow—and the resulting cost reductions, latency improvements, and future roadmap.

Apache ArrowBatch ProcessingParquet
0 likes · 17 min read
How Feed Real‑Time Data Warehouse Was Re‑Engineered for Speed and Cost Savings
Architect
Architect
Jul 7, 2025 · Big Data

How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×

This article explains how Baidu’s search data team redesigned its data warehouse with wide‑table modeling, Parquet columnar storage, and a Spark‑ClickHouse fusion engine, eliminating redundancy, cutting query latency from minutes to seconds, and enabling self‑service analytics for thousands of users.

Data WarehouseETLParquet
0 likes · 21 min read
How Baidu’s New Search Data Warehouse Architecture Boosts Performance by 5×
Volcano Engine Developer Services
Volcano Engine Developer Services
Jun 20, 2022 · Big Data

How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study

ByteDance tackled massive feature‑storage challenges by replacing row‑based HDFS files with columnar Parquet and the Iceberg table format, enabling schema evolution, selective reads, efficient backfill, and training optimizations that cut storage costs by over 40% and reduced CPU and network I/O dramatically.

Big DataData LakeIceberg
0 likes · 13 min read
How ByteDance Scaled Feature Storage with Iceberg and Parquet: A Big Data Case Study
Baidu Geek Talk
Baidu Geek Talk
Jun 15, 2022 · Big Data

Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges

The article proposes replacing the traditional multi‑layered data‑warehouse architecture (ODS‑DWD‑DWS‑ADS) with a single, column‑store wide‑table per business theme, achieving roughly 30 % storage savings and faster queries, while acknowledging higher ETL complexity, back‑tracking costs, and production timing challenges.

Big DataData WarehouseETL
0 likes · 11 min read
Replacing Classic Data Warehouse with a One‑Layer Wide Table Model: Architecture, Benefits, and Challenges
Big Data Technology Architecture
Big Data Technology Architecture
Apr 5, 2021 · Big Data

Understanding Apache Iceberg: Table Format Architecture, Comparison with Hive Metastore, and Business Benefits

This article introduces Apache Iceberg as an open table format for massive analytic datasets, explains its underlying concepts such as schema, partitioning, statistics, and read/write APIs, compares it with Hive Metastore, outlines its ACID commit process, highlights the performance and operational advantages for big‑data workloads, and previews upcoming community features.

ACIDApache IcebergParquet
0 likes · 19 min read
Understanding Apache Iceberg: Table Format Architecture, Comparison with Hive Metastore, and Business Benefits
Laravel Tech Community
Laravel Tech Community
Feb 28, 2021 · Big Data

Apache Beam 2.28.0 Release Highlights and New Features

Apache Beam 2.28.0 introduces extensive Parquet support, new hash functions in BeamSQL and ZetaSQL, ApproximateDistinct via HLL, enhanced I/O connectors including SpannerIO for Numeric fields, ParquetIO schema support, KafkaTableProvider thrift, HadoopFormatIO key/value cloning skip, and various other improvements.

Apache BeamBatchBig Data
0 likes · 3 min read
Apache Beam 2.28.0 Release Highlights and New Features
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 5, 2021 · Big Data

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

This article details a real‑world investigation of Spark SQL job latency on a YARN cluster, explains how switching the scheduler to FAIR mode, creating resource pools, and consolidating small Parquet files dramatically reduced scheduler delay and cut execution time from over 100 seconds to under 20 seconds.

ParquetPerformance OptimizationScheduler
0 likes · 13 min read
Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains
DataFunTalk
DataFunTalk
Oct 29, 2020 · Big Data

Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink

Lyft transformed its legacy data pipeline by designing a cloud‑native, Flink‑based near real‑time analytics platform that ingests billions of events, writes Parquet files to S3, leverages Presto for interactive queries, and implements multi‑stage non‑blocking ETL, fault‑tolerant back‑fill, and extensive performance optimizations.

AWSData LakeETL
0 likes · 12 min read
Building a Large-Scale Near Real-Time Data Analytics Platform at Lyft Using Apache Flink
Big Data Technology Architecture
Big Data Technology Architecture
May 19, 2020 · Big Data

An Overview of Apache Parquet: Architecture, Features, and Comparison with ORC

Apache Parquet is a language‑agnostic, columnar storage format for the Hadoop ecosystem that offers high compression, efficient I/O through column and predicate push‑down, nested‑structure support, and a three‑layer architecture, and is compared with ORC while providing tooling for schema inspection.

Apache HadoopColumnar StorageData Formats
0 likes · 9 min read
An Overview of Apache Parquet: Architecture, Features, and Comparison with ORC
Big Data Technology Architecture
Big Data Technology Architecture
Apr 24, 2020 · Big Data

Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

The article introduces Kyligence's Kylin on Parquet solution, explains its plug‑in architecture, reasons for replacing HBase with Parquet, details the new Spark‑based build and query engines, auto‑tuning, global dictionary, fault‑tolerance features, and presents performance comparisons with Kylin 3.0.

Apache KylinData WarehouseParquet
0 likes · 11 min read
Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation
Qunar Tech Salon
Qunar Tech Salon
Feb 26, 2017 · Big Data

Comparative Analysis of Big Data Storage and Query Solutions

This article reviews major big‑data storage and query architectures—including HBase, Dremel/Parquet, pre‑aggregation systems, Lucene, and the custom Tindex solution—evaluating their strengths, weaknesses, and suitability for real‑time, high‑volume analytical workloads.

Big DataHBaseParquet
0 likes · 20 min read
Comparative Analysis of Big Data Storage and Query Solutions