Tagged articles
232 articles
Page 1 of 3
dbaplus Community
dbaplus Community
May 20, 2026 · Databases

Stunning SQL Queries: From Tetris Game to Real‑Time Funnels

This article showcases a collection of impressive SQL queries—including a PostgreSQL Tetris implemented with a recursive CTE, window‑function session analysis, a ClickHouse real‑time funnel, dynamic WHERE clause generation, and a recursive employee hierarchy—while discussing performance tips and engine choices.

clickhousedata-warehousehive
0 likes · 25 min read
Stunning SQL Queries: From Tetris Game to Real‑Time Funnels
Architect-Kip
Architect-Kip
Mar 2, 2026 · Big Data

How to Build a Scalable Tiered Archive & Query System for MySQL Data

This article presents a comprehensive design for a layered storage and unified scheduling platform that archives MySQL historical data, reduces storage costs, ensures high‑performance queries, and enables efficient data analysis through tiered hot, warm, and cold storage using big‑data technologies.

FlinkSparkdata archiving
0 likes · 13 min read
How to Build a Scalable Tiered Archive & Query System for MySQL Data
Big Data Tech Team
Big Data Tech Team
Jan 5, 2026 · Big Data

Top 10 Data Warehouse Interview Questions Every 2026 Engineer Must Master

This article compiles the most frequently asked interview questions for 2026 data‑warehouse development engineers, covering core concepts, layer architecture, SQL optimization, window functions, Hive vs Spark, data skew solutions, modeling metrics, slowly changing dimensions, scheduling tools, data quality monitoring, and real project experience.

SQL OptimizationSparkdata modeling
0 likes · 8 min read
Top 10 Data Warehouse Interview Questions Every 2026 Engineer Must Master
Big Data Tech Team
Big Data Tech Team
Dec 28, 2025 · Big Data

When to Use Hive Partitioning vs Bucketing: A Practical Guide

This article explains Hive's partitioning and bucketing techniques, compares their purposes, advantages, and pitfalls, and shows how to combine them with concrete SQL examples to improve query performance, reduce I/O, and optimize joins and sampling in large data warehouses.

BucketingPartitioningdata-warehouse
0 likes · 7 min read
When to Use Hive Partitioning vs Bucketing: A Practical Guide
JD Tech
JD Tech
Dec 24, 2025 · Databases

How to Eliminate 30‑Minute Master‑Slave Lag in High‑Volume Inventory Systems

This article analyzes why a warehouse management system’s master‑slave database replication lagged up to 30 minutes during nightly inventory snapshot generation, evaluates several mitigation strategies, and details the chosen big‑data‑driven solution that moved snapshots to Elasticsearch, reducing lag and disk usage.

Database ReplicationElasticsearchhive
0 likes · 8 min read
How to Eliminate 30‑Minute Master‑Slave Lag in High‑Volume Inventory Systems
Big Data Tech Team
Big Data Tech Team
Oct 10, 2025 · Big Data

12 Essential Hive SQL Optimization Tricks to Boost Query Performance

This article presents twelve practical Hive SQL tuning techniques—ranging from avoiding COUNT(DISTINCT) to configuring parallel execution, reducer settings, and strict mode—to help data engineers reduce data skew, eliminate small files, improve resource utilization, and significantly accelerate query execution in large‑scale data warehouse environments.

SQL Optimizationdata-warehousehive
0 likes · 11 min read
12 Essential Hive SQL Optimization Tricks to Boost Query Performance
Huolala Tech
Huolala Tech
Sep 26, 2025 · Big Data

How We Migrated 40 PB of Hive Data Across Clouds with Zero Downtime

This article details the end‑to‑end design, challenges, and implementation of a cross‑cloud migration of over 200 k Hive tables and nearly 40 PB of data using the self‑developed Kirk service, covering architecture, verification steps, and lessons learned to achieve 100 % data consistency without impacting production services.

Big DataData ConsistencyData Migration
0 likes · 20 min read
How We Migrated 40 PB of Hive Data Across Clouds with Zero Downtime
Big Data Tech Team
Big Data Tech Team
Aug 25, 2025 · Interview Experience

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

A comprehensive list of interview topics covering self‑introduction, career moves, data‑warehouse design, team building, architecture comparisons, fact‑table classification, common dimensions, performance tuning, and data‑governance for aspiring big‑data engineers.

Big DataData GovernanceFlink
0 likes · 4 min read
Essential Big Data Interview Questions for Data Warehouse Engineer Roles
Big Data Tech Team
Big Data Tech Team
Jul 17, 2025 · Big Data

Master Hive SQL: 10 Advanced Use Cases & Performance Optimizations for Hive 3.x

This article presents ten practical Hive SQL advanced scenarios—including session segmentation, funnel conversion, median calculation, array explosion, hierarchical recursion, deduplication, small‑file merging, conditional aggregation, approximate statistics, and data‑quality checks—each with full SQL code, key technical points, and optimization tips for Hive 3.x.

data-warehousehiveoptimization
0 likes · 9 min read
Master Hive SQL: 10 Advanced Use Cases & Performance Optimizations for Hive 3.x
Big Data Tech Team
Big Data Tech Team
Apr 27, 2025 · Big Data

10 Advanced Hive SQL Use Cases: Windows, Skew, JSON, and More

This article presents ten practical Hive SQL scenarios—including window functions for ranking, LAG for time‑interval analysis, random‑salt techniques to mitigate data skew, dynamic partition writes, JSON parsing with UDFs, retention calculations, consecutive‑login detection, regex‑based path analysis, CUBE multi‑dimensional aggregation, and ORC storage optimizations—each accompanied by optimization tips and complete code examples.

data-warehousehiveperformance optimization
0 likes · 9 min read
10 Advanced Hive SQL Use Cases: Windows, Skew, JSON, and More
macrozheng
macrozheng
Apr 18, 2025 · Big Data

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive data, introduces Elasticsearch’s advantages, and details a practical architecture using Hive, Canal, and Otter to achieve near real‑time indexing of petabyte‑scale datasets with minimal latency.

Big DataCanalData Transfer Service
0 likes · 20 min read
How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data
Ma Wei Says
Ma Wei Says
Mar 9, 2025 · Big Data

Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips

This article provides a comprehensive guide to designing the Data Warehouse Detail (DWD) layer, covering Kimball‑based design principles, step‑by‑step modeling, table and field naming conventions, concrete Hive DDL/DML examples, and optimization techniques such as partitioning, bucketing, and compression.

Big DataDWDFact Table
0 likes · 21 min read
Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips
Airbnb Technology Team
Airbnb Technology Team
Jan 24, 2025 · Artificial Intelligence

Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning

Chronon is an open‑source framework that centralizes feature definitions to guarantee training‑inference consistency, eliminates complex ETL pipelines, and supports real‑time and batch processing across diverse data sources, cutting feature‑development cycles from months to under a week, as demonstrated by Airbnb’s 40,000‑feature deployment.

ChrononSparkfeature engineering
0 likes · 10 min read
Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning
dbaplus Community
dbaplus Community
Jan 19, 2025 · Big Data

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

This article shares practical techniques for writing clean, efficient SQL in large‑scale data environments, covering predicate pushdown, sub‑queries, deduplication strategies, bucket optimization, and automation with Python‑Spark integration to improve readability and execution speed.

Sparkhiveoptimization
0 likes · 14 min read
How to Write Elegant, High‑Performance SQL for Big Data Pipelines
JD Tech
JD Tech
Dec 30, 2024 · Big Data

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

The article shares practical methods and code examples for making SQL both readable and high‑performing in large‑scale data platforms, covering predicate push‑down with subqueries, deduplication strategies, bucket utilization, and Python‑driven job parameter handling.

Sparkdata engineeringhive
0 likes · 14 min read
Techniques for Writing Elegant and Efficient SQL in Big Data Environments
Qunar Tech Salon
Qunar Tech Salon
Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataMapReduceSmall Files
0 likes · 23 min read
Understanding and Solving Small File Problems in Hive and Spark
Su San Talks Tech
Su San Talks Tech
Dec 8, 2024 · Big Data

How to Build Near Real-Time ElasticSearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive datasets, introduces ElasticSearch’s inverted‑index architecture, and details a practical pipeline using Hive, wide tables, binlog, Canal, and Otter to achieve near real‑time indexing for petabyte‑level data.

CanalOtterdata pipeline
0 likes · 19 min read
How to Build Near Real-Time ElasticSearch Indexes for PB-Scale Data
Baidu Tech Salon
Baidu Tech Salon
Nov 20, 2024 · Big Data

Optimizing Multi‑Dimensional User Count Computation in Feed Using Data Tagging

By deduplicating logs and assigning compact numeric tags to each user‑dimension combination, the data‑tagging method replaces costly lateral‑view expansions with a user‑level aggregation, cutting shuffle volume from terabytes to gigabytes and reducing runtime from 49 minutes to 14 minutes, enabling scalable multi‑dimensional user‑count analysis for Baidu Feed.

SQL Optimizationdata taggingdimensional aggregation
0 likes · 14 min read
Optimizing Multi‑Dimensional User Count Computation in Feed Using Data Tagging
Shopee Tech Team
Shopee Tech Team
Oct 25, 2024 · Big Data

StarRocks at Shopee: Practical Use Cases and Performance Analysis

Shopee’s deployment of StarRocks across DataService, DataGo, and DataStudio demonstrates that its vectorized engine, cost‑based optimizer, and materialized‑view caching can query Hive, Iceberg, Delta Lake and Hudi up to 20,000× faster than Presto, cutting CPU usage and delivering consistently lower latency for complex analytics.

Data LakeMPPPresto
0 likes · 11 min read
StarRocks at Shopee: Practical Use Cases and Performance Analysis
Architect
Architect
Jul 18, 2024 · Backend Development

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

This article details the architecture, design principles, data preparation methods, verification processes, and error‑handling strategies of ZuanZuan's payment reconciliation system, highlighting how large‑scale data, binlog ingestion, Hive archiving, and MQ‑based workflows ensure accurate and secure financial settlements.

Backend ArchitectureMQReconciliation
0 likes · 11 min read
Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments
Zhuanzhuan Tech
Zhuanzhuan Tech
May 23, 2024 · Backend Development

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

This article details the background, architecture, data preparation methods, massive‑data handling strategies, verification processes, and error‑handling mechanisms of ZuanZuan's channel reconciliation system, highlighting design choices such as binlog ingestion, task‑driven bill downloads, sharding with Hive archiving, and MQ‑based reconciliation to ensure financial data consistency and safety.

MQReconciliationdata pipeline
0 likes · 11 min read
Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 30, 2024 · Big Data

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

This article explores practical SQL optimization techniques for Alibaba's ODPS platform, covering fundamentals, common pitfalls like null handling and select *, advanced strategies such as multi‑insert, partition limiting, UDF placement, data‑skew mitigation, parameter tuning, and real‑world case studies that dramatically reduce query runtimes.

Big DataData SkewMaxCompute
0 likes · 23 min read
Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew
Sohu Tech Products
Sohu Tech Products
Apr 24, 2024 · Big Data

How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

This article explains the concepts, formulas, and step‑by‑step implementation of a user‑retention analysis model, covering both Hive‑based offline processing and ClickHouse‑accelerated real‑time queries, complete with SQL examples, architecture diagrams, and practical optimization tips.

Big DataData visualizationRetention Analysis
0 likes · 19 min read
How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior
vivo Internet Technology
vivo Internet Technology
Apr 17, 2024 · Big Data

Retention Analysis Model Practice Based on ClickHouse

The article explains retention analysis models, their importance for user loyalty, outlines offline Hive architecture, then shows how ClickHouse’s retention() function and columnar storage dramatically speed up multi‑day retention calculations, providing SQL examples and practical guidance for product analytics.

Retention AnalysisSQL Optimizationclickhouse
0 likes · 17 min read
Retention Analysis Model Practice Based on ClickHouse
DataFunTalk
DataFunTalk
Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataData Migration
0 likes · 14 min read
Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 8, 2024 · Big Data

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

iQIYI migrated hundreds of petabytes of Hive tables to Apache Iceberg using dual‑write, in‑place, and CTAS strategies, combined with partition pruning, Bloom filters, and Trino/Alluxio optimizations, achieving up to 40% lower query latency, simplified pipelines, and faster, cost‑effective data lake operations.

Data LakeIceberghive
0 likes · 20 min read
Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation
Weimob Technology Center
Weimob Technology Center
Jan 2, 2024 · Big Data

How to Efficiently Test BI Reports in a Hive‑StarRocks Data Warehouse

This article details practical methods for testing BI reports built on Hive and StarRocks, covering the report creation workflow, testing characteristics, SQL writing techniques, impact analysis, data warehouse simplification, and the application of data quality tools to ensure accurate and efficient reporting.

BI testingData QualityStarRocks
0 likes · 9 min read
How to Efficiently Test BI Reports in a Hive‑StarRocks Data Warehouse
DataFunTalk
DataFunTalk
Dec 27, 2023 · Big Data

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

This article describes how NetEase Youdao replaced its Doris‑based real‑time data warehouse with Amoro Mixed Hive, detailing the architectural challenges, the Mixed Hive design, implementation steps, performance optimizations, community contributions, and future roadmap to achieve a unified lakehouse with minute‑level freshness and reduced development and operational costs.

AmoroBig DataFlink
0 likes · 12 min read
Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing
Selected Java Interview Questions
Selected Java Interview Questions
Nov 5, 2023 · Backend Development

Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders

This article presents a comprehensive design of a distributed reconciliation system that handles tens of millions of daily payment orders by using a six‑module architecture, Kafka for decoupled state transitions, Hive for large‑scale data processing, and Java‑based plug‑in patterns to achieve six‑nine accuracy and significant operational cost savings.

Big DataDistributed SystemsKafka
0 likes · 15 min read
Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders
政采云技术
政采云技术
Aug 23, 2023 · Big Data

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This comprehensive tutorial details the process of deploying a complete Hadoop-based big data ecosystem on ARM architecture, covering the installation and configuration of essential components including Java, Zookeeper, Hadoop, MySQL, Hive, and Spark with practical code examples.

ARM architectureCluster DeploymentDistributed Systems
0 likes · 19 min read
Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture
JD Retail Technology
JD Retail Technology
Aug 21, 2023 · Artificial Intelligence

ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios

This article examines how ChatGPT-4, as an advanced natural‑language‑processing model, can streamline data analysis tasks—from generating Hive table definitions and sample data to crafting complex HiveSQL queries, visualizing results, and implementing ClickHouse and Flink solutions—thereby improving efficiency, insight, and problem‑solving in big‑data environments.

Artificial IntelligenceBig DataChatGPT-4
0 likes · 7 min read
ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios
JD Tech
JD Tech
Jun 14, 2023 · Big Data

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

This article explains the concept of data skew in offline big‑data jobs, describes its symptoms and root causes, and provides practical optimization techniques for Hive and Spark—including partitioning strategies, map‑join usage, adaptive query settings, and monitoring approaches—to prevent performance degradation and runtime failures.

Data SkewShuffleSpark
0 likes · 17 min read
Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)
Data Thinking Notes
Data Thinking Notes
May 10, 2023 · Big Data

Mastering Hive Small File Management: Strategies to Boost Performance

This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.

Small FilesSparkhive
0 likes · 10 min read
Mastering Hive Small File Management: Strategies to Boost Performance
Big Data Technology & Architecture
Big Data Technology & Architecture
May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataSmall FilesSpark
0 likes · 9 min read
Strategies for Handling Small Files in Hive and Spark
政采云技术
政采云技术
Apr 18, 2023 · Big Data

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

This article explains how to perform task‑level data cost governance by collecting storage and compute metrics from Hive tables, Spark jobs, and HDFS FsImage files, then estimating monthly expenses using replication factors and resource‑usage rates, while providing practical SQL and shell examples.

Data Cost GovernanceHDFSSpark
0 likes · 18 min read
Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage
JD Retail Technology
JD Retail Technology
Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew
0 likes · 15 min read
Understanding Data Skew and Its Mitigation in Hive and Spark
ITPUB
ITPUB
Mar 28, 2023 · Big Data

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.

Apache HudiBig DataFlink
0 likes · 11 min read
How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi
DataFunSummit
DataFunSummit
Mar 20, 2023 · Backend Development

Unified UDF Implementation on Cloud Platform: Architecture, Features, and Open‑Source Contributions

This article introduces a unified User‑Defined Function (UDF) solution on a cloud data platform, detailing its remote execution architecture, compatibility with Hive UDFs, resource isolation, hot‑update capabilities, internal platform implementation, open‑source contributions to PrestoDB, and future work plans.

PrestoServerlessUDF
0 likes · 11 min read
Unified UDF Implementation on Cloud Platform: Architecture, Features, and Open‑Source Contributions
Bilibili Tech
Bilibili Tech
Mar 10, 2023 · Information Security

Data Security Construction in Berserker Platform

The article outlines Berserker’s comprehensive data‑security framework—built on the CIA triad and 5A methodology—that unifies authentication, authorization, access control, asset protection, and auditing across Hive, Kafka, ClickHouse and ETL tasks, describes the migration from version 1.0 to 2.0 with a redesigned permission system, workspaces, Casbin performance tweaks, and previews future fine‑grained, lifecycle‑wide security enhancements.

AuthenticationAuthorizationBerserker platform
0 likes · 15 min read
Data Security Construction in Berserker Platform
Su San Talks Tech
Su San Talks Tech
Feb 27, 2023 · Big Data

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains how to construct near real-time Elasticsearch indexes for petabyte‑scale datasets by comparing MySQL limitations, introducing ES fundamentals, leveraging Hive and wide tables, and employing binlog‑based tools like Canal and Otter for low‑latency data synchronization.

CanalElasticsearchOtter
0 likes · 22 min read
How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

Data Task Optimization Techniques and Practices

The article surveys unconventional offline data‑task optimizations—such as distribution‑by, seeded random shuffling, explode‑based skew mitigation, hash bucketing, task‑parallelism tuning, and multi‑insert materialization—organized by point, line, and surface perspectives, and stresses that effective performance gains require both technical tricks and business‑driven pipeline adjustments.

SQL Tuningdata optimizationdistributed computing
0 likes · 16 min read
Data Task Optimization Techniques and Practices
Java High-Performance Architecture
Java High-Performance Architecture
Jan 5, 2023 · Databases

Scaling Billions of Orders: MySQL Sharding, ES & Hive Strategies

This article explains how to handle massive order volumes by classifying data into hot and cold tiers, storing them in MySQL, Elasticsearch, and Hive, and implementing sharding and partitioning strategies—including shard keys, modulo routing, and combined database‑table distribution—to achieve high throughput and low cost.

Elasticsearchdatabase scalinghive
0 likes · 8 min read
Scaling Billions of Orders: MySQL Sharding, ES & Hive Strategies
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 3, 2023 · Big Data

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.

Batch ProcessingBig DataFlink
0 likes · 19 min read
Migrating Hive SQL Jobs to Flink Using the SQL Gateway
Architecture Digest
Architecture Digest
Jan 2, 2023 · Databases

Database Sharding and Partitioning Strategy for High‑Volume Order Systems

This article explains how to classify massive order data into hot and cold segments, store them in MySQL, Elasticsearch and Hive respectively, and implement sharding and partitioning at both table and database levels using modulo and hash calculations to achieve scalable performance for billions of orders.

Partitioningarchitecturehive
0 likes · 8 min read
Database Sharding and Partitioning Strategy for High‑Volume Order Systems
Architect
Architect
Dec 30, 2022 · Databases

Database Sharding and Partitioning Strategy for High‑Volume Order Systems

The article explains how to handle billions of daily orders by classifying data into hot and cold segments, storing them in MySQL, Elasticsearch, and Hive, and applying sharding and partitioning techniques at both table and database levels to achieve scalable performance.

Data PartitioningElasticsearchdatabase sharding
0 likes · 9 min read
Database Sharding and Partitioning Strategy for High‑Volume Order Systems
Ziru Technology
Ziru Technology
Dec 16, 2022 · Big Data

How to Effectively Test Offline Data Metrics and Data Warehouse Pipelines

This article explains what data metrics are, compares offline metric testing with traditional testing, and provides a comprehensive step‑by‑step guide for testing data collection, ETL, warehouse models, metric calculations, scheduling, security, and API outputs in a Hive‑based data warehouse.

ETLdata validationdata-warehouse
0 likes · 9 min read
How to Effectively Test Offline Data Metrics and Data Warehouse Pipelines
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Batch ProcessingBig DataData Lake
0 likes · 13 min read
Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans
DeWu Technology
DeWu Technology
Nov 30, 2022 · Big Data

Fundamentals and Implementation of Data Lineage in Big Data Environments

Data lineage in big‑data environments tracks how data moves and transforms—from source tables through SQL processing to final storage—enabling management tasks such as domain segmentation, performance tuning, anomaly detection, and dependency verification, with implementations ranging from simple regex extraction to robust AST parsing and optimization, as used by tools like Alibaba DataWorks and Apache Atlas.

ASTBig DataData Lineage
0 likes · 7 min read
Fundamentals and Implementation of Data Lineage in Big Data Environments
Data Thinking Notes
Data Thinking Notes
Nov 22, 2022 · Big Data

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Big DataData SkewRDS
0 likes · 5 min read
Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It
vivo Internet Technology
vivo Internet Technology
Nov 16, 2022 · Big Data

Vivo Hawking A/B Experiment Platform: Architecture, Practices, and Solutions

The Vivo Hawking platform provides a company‑wide, one‑stop A/B testing solution with a layered architecture, covariate‑balanced split algorithms, real‑time monitoring, and unified SDKs for Android, Java and H5, enabling thousands of daily experiments, automated analysis, and rapid product iteration across multiple departments.

Covariate balancingExperiment PlatformJava SDK
0 likes · 22 min read
Vivo Hawking A/B Experiment Platform: Architecture, Practices, and Solutions
dbaplus Community
dbaplus Community
Oct 30, 2022 · Big Data

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

This article explains the importance of layering in data warehouse modeling, outlines the four ETL steps, describes common pitfalls, presents a typical technical stack, and details each warehouse layer (ODS, DWD, DWS, ADS) along with best‑practice naming conventions and implementation tips for big‑data environments.

ETLModelingSpark
0 likes · 38 min read
Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs
Bilibili Tech
Bilibili Tech
Sep 30, 2022 · Big Data

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

RoaringBitmap improves traditional BitMap by lazily allocating four container types, compressing sparse data, and dynamically switching between array, bitmap, and run containers, enabling fast exact set operations that power big‑data systems such as Kylin, ClickHouse, and B‑Station’s user‑visit and crowd‑package pipelines, dramatically reducing memory use and processing latency.

Big DataBitmap CompressionData Structures
0 likes · 16 min read
From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications
DataFunSummit
DataFunSummit
Sep 21, 2022 · Big Data

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

Apache DorisBig DataDMP
0 likes · 10 min read
Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query
DataFunTalk
DataFunTalk
Sep 15, 2022 · Big Data

Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations

This article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing the migration process, automated SQL conversion, result verification, stability and performance enhancements, meta‑store optimizations, and future work on remote shuffle and vectorized execution.

Data SkippingMetaStoreShuffle
0 likes · 28 min read
Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations
DaTaobao Tech
DaTaobao Tech
Sep 6, 2022 · Big Data

SQL Optimization Techniques for ODPS (Open Data Processing Service)

The article presents practical ODPS SQL optimization strategies—including explicit column selection, partition limiting, multi‑insert, proper handling of nulls, join‑type choices, map‑join and skew hints, bucketed tables, and tuned task parameters—illustrated with three real‑world cases that dramatically cut execution time and resource usage.

Big DataData SkewODPS
0 likes · 23 min read
SQL Optimization Techniques for ODPS (Open Data Processing Service)
DataFunTalk
DataFunTalk
Aug 14, 2022 · Big Data

NetEase Yanxuan DMP Tag System Construction Practice

This article details NetEase Yanxuan’s DMP tag system, covering its platform overview, tag production workflow, storage architecture, high‑performance query techniques, and future plans, illustrating how data from multiple sources is processed through ODS, DWD, DM layers and leveraged via Spark, Hive, and Apache Doris for real‑time and offline analytics.

Apache DorisDMPReal-time Query
0 likes · 11 min read
NetEase Yanxuan DMP Tag System Construction Practice
ITPUB
ITPUB
Aug 1, 2022 · Big Data

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

This article details Bilibili's evolution from a Hadoop‑based offline platform to a Spark‑driven architecture, covering the Hive‑to‑Spark migration, automated SQL conversion, result validation, stability enhancements, performance tuning, meta‑store federation, and future directions for large‑scale data processing.

Big DataData SkippingMetaStore
0 likes · 31 min read
How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance
ITPUB
ITPUB
Jul 23, 2022 · Information Security

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

This article details Bilibili's implementation of Apache Ranger for fine‑grained access control across Hadoop, HDFS, Hive, Spark, and Presto, covering architecture, API redesign, admin optimizations, gray‑release strategies, permission pre‑checks, data masking, and future plans for incremental policy loading.

HDFSPrestoSpark
0 likes · 16 min read
How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive
Bilibili Tech
Bilibili Tech
Jul 22, 2022 · Information Security

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Bilibili’s data platform redesigns Ranger‑based access control by simplifying HDFS and Hive policy APIs, parallelizing policy loading, adding gray‑release and pre‑check mechanisms, integrating fine‑grained Hive authorization with data‑masking, extending support to Spark and Presto, and planning incremental loading, policy fusion, and a NameNode proxy to boost security and performance.

HDFSPrestoSpark
0 likes · 15 min read
Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform
Big Data Technology Architecture
Big Data Technology Architecture
Jun 8, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Comprehensive Performance Optimizations

The article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing migration tools, SQL conversion, result and resource comparison, shuffle stability, small‑file handling, runtime filters, data skipping, ZSTD support, Hive Metastore federation, traffic control, and future optimization directions.

Data MigrationResource ManagementSpark
0 likes · 29 min read
Bilibili Offline Computing Platform: Migration from Hive to Spark and Comprehensive Performance Optimizations
Bilibili Tech
Bilibili Tech
May 31, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Bilibili migrated its massive offline platform from Hive to Spark using an automated SQL rewrite and dual‑run verification, cutting execution time over 40% and resource use 30%, while introducing small‑file merging, shuffle stability, runtime filters, data‑skipping, lineage tracking, auto‑parameter tuning, and metastore federation for robust large‑scale processing.

Big DataSparkdata engineering
0 likes · 30 min read
Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write
0 likes · 43 min read
Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management
ByteDance Data Platform
ByteDance Data Platform
May 11, 2022 · Big Data

How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility

This article explains how to design and implement a SparkSQL server that lowers usage barriers and boosts efficiency by supporting standard JDBC interfaces, integrating Hive Server2 protocols, handling multi‑tenant authentication, managing Spark job lifecycles, and providing high‑availability through Zookeeper coordination.

JDBCServer ArchitectureSparkSQL
0 likes · 15 min read
How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 15, 2022 · Big Data

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

This guide explains how to set up the Flink SQL client to work with Apache Iceberg, covering Scala version requirements, downloading and deploying Iceberg jars, configuring Hive and HDFS catalogs, creating databases and tables, performing insert and overwrite operations, and querying data in both batch and streaming modes.

Big DataCatalogFlink
0 likes · 18 min read
Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query
Zuoyebang Tech Team
Zuoyebang Tech Team
Apr 13, 2022 · Big Data

How Delta Lake Transformed Our Offline Data Warehouse Performance

This article details how ZuoYeBang's engineering team migrated their Hive‑based offline data warehouse to Delta Lake, tackling latency, scalability, and query‑performance challenges through stream‑to‑batch processing, data‑lake architecture, and optimizations like DPP and Z‑ordering.

Big DataDelta LakePresto
0 likes · 15 min read
How Delta Lake Transformed Our Offline Data Warehouse Performance
ByteDance Data Platform
ByteDance Data Platform
Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataETLSparkSQL
0 likes · 19 min read
Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL
DataFunTalk
DataFunTalk
Feb 15, 2022 · Big Data

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

The article details Vipshop's multi‑dimensional use of SeaTunnel to integrate Hive and ClickHouse, describing data import/export challenges, tool selection among DataX, SeaTunnel and Spark, custom configurations, platform integration, and future improvements for high‑performance OLAP pipelines.

Big DataData IntegrationData Platform
0 likes · 15 min read
SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration
IT Architects Alliance
IT Architects Alliance
Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Backend ArchitectureBig DataSpark
0 likes · 32 min read
Designing a Daily Million-Transaction Payment Reconciliation System
IT Xianyu
IT Xianyu
Jan 27, 2022 · Big Data

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

This tutorial provides step‑by‑step instructions for installing Hadoop 3.1.1, Homebrew, Hive, and configuring MySQL as Hive's metastore on macOS, including environment variable setup, hive‑site.xml configuration, MySQL connector placement, schema initialization, and verification commands.

Big DataHadoopInstallation
0 likes · 6 min read
Installing Apache Hive on macOS with Hadoop and MySQL Metastore
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 28, 2021 · Big Data

Comprehensive Guide to Spark SQL: Concepts, DataSet/DataFrame, Functions, Optimization and Common Pitfalls

This article provides an in‑depth overview of Spark SQL, covering its architecture, DataSet/DataFrame creation, DSL and SQL usage, integration with Hive, custom UDF/UDAF/Aggregator implementations, handling of small files, Cartesian product detection, and a catalog of useful built‑in functions and window operations.

Big DataDatasetSpark SQL
0 likes · 29 min read
Comprehensive Guide to Spark SQL: Concepts, DataSet/DataFrame, Functions, Optimization and Common Pitfalls
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 18, 2021 · Big Data

Slowly Changing Dimensions (SCD) – Design Principles, Challenges, and Hive Implementation

This article explains the concept of Slowly Changing Dimensions (SCD), discusses practical design questions, compares three change‑tracking requirements, presents three implementation patterns, and provides detailed Hive/SQL examples for historical data initialization and incremental updates in large‑scale data warehouses.

Big DataSCDdata-warehouse
0 likes · 20 min read
Slowly Changing Dimensions (SCD) – Design Principles, Challenges, and Hive Implementation