Tag

Hive

0 views collected around this technical thread.

macrozheng
macrozheng
Apr 18, 2025 · Big Data

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive data, introduces Elasticsearch’s advantages, and details a practical architecture using Hive, Canal, and Otter to achieve near real‑time indexing of petabyte‑scale datasets with minimal latency.

Big DataCanalData Transfer Service
0 likes · 20 min read
How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data
Airbnb Technology Team
Airbnb Technology Team
Jan 24, 2025 · Artificial Intelligence

Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning

Chronon is an open‑source framework that centralizes feature definitions to guarantee training‑inference consistency, eliminates complex ETL pipelines, and supports real‑time and batch processing across diverse data sources, cutting feature‑development cycles from months to under a week, as demonstrated by Airbnb’s 40,000‑feature deployment.

ChrononFeature EngineeringHive
0 likes · 10 min read
Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning
JD Tech
JD Tech
Dec 30, 2024 · Big Data

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

The article shares practical methods and code examples for making SQL both readable and high‑performing in large‑scale data platforms, covering predicate push‑down with subqueries, deduplication strategies, bucket utilization, and Python‑driven job parameter handling.

Big DataData EngineeringHive
0 likes · 14 min read
Techniques for Writing Elegant and Efficient SQL in Big Data Environments
Qunar Tech Salon
Qunar Tech Salon
Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce
0 likes · 23 min read
Understanding and Solving Small File Problems in Hive and Spark
Baidu Tech Salon
Baidu Tech Salon
Nov 20, 2024 · Big Data

Optimizing Multi‑Dimensional User Count Computation in Feed Using Data Tagging

By deduplicating logs and assigning compact numeric tags to each user‑dimension combination, the data‑tagging method replaces costly lateral‑view expansions with a user‑level aggregation, cutting shuffle volume from terabytes to gigabytes and reducing runtime from 49 minutes to 14 minutes, enabling scalable multi‑dimensional user‑count analysis for Baidu Feed.

Big DataHivePerformance Tuning
0 likes · 14 min read
Optimizing Multi‑Dimensional User Count Computation in Feed Using Data Tagging
Shopee Tech Team
Shopee Tech Team
Oct 25, 2024 · Big Data

StarRocks at Shopee: Practical Use Cases and Performance Analysis

Shopee’s deployment of StarRocks across DataService, DataGo, and DataStudio demonstrates that its vectorized engine, cost‑based optimizer, and materialized‑view caching can query Hive, Iceberg, Delta Lake and Hudi up to 20,000× faster than Presto, cutting CPU usage and delivering consistently lower latency for complex analytics.

HiveMPPPerformance Benchmark
0 likes · 11 min read
StarRocks at Shopee: Practical Use Cases and Performance Analysis
IT Xianyu
IT Xianyu
Aug 26, 2024 · Big Data

Hive Data Warehouse: Modeling, Partitioning, and ID‑Mapping for User Profiles

This article explains how Hive serves as a data‑warehouse layer for user‑profile tagging, covering data‑warehouse fundamentals, fact‑and‑dimension modeling, partitioned storage, label aggregation, and ID‑mapping techniques with practical Hive DDL/DML examples.

Big DataData WarehouseETL
0 likes · 11 min read
Hive Data Warehouse: Modeling, Partitioning, and ID‑Mapping for User Profiles
Architect
Architect
Jul 18, 2024 · Backend Development

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

This article details the architecture, design principles, data preparation methods, verification processes, and error‑handling strategies of ZuanZuan's payment reconciliation system, highlighting how large‑scale data, binlog ingestion, Hive archiving, and MQ‑based workflows ensure accurate and secure financial settlements.

Data ProcessingHiveMQ
0 likes · 11 min read
Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments
Zhuanzhuan Tech
Zhuanzhuan Tech
May 23, 2024 · Backend Development

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

This article details the background, architecture, data preparation methods, massive‑data handling strategies, verification processes, and error‑handling mechanisms of ZuanZuan's channel reconciliation system, highlighting design choices such as binlog ingestion, task‑driven bill downloads, sharding with Hive archiving, and MQ‑based reconciliation to ensure financial data consistency and safety.

BackendHiveMQ
0 likes · 11 min read
Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments
vivo Internet Technology
vivo Internet Technology
Apr 17, 2024 · Big Data

Retention Analysis Model Practice Based on ClickHouse

The article explains retention analysis models, their importance for user loyalty, outlines offline Hive architecture, then shows how ClickHouse’s retention() function and columnar storage dramatically speed up multi‑day retention calculations, providing SQL examples and practical guidance for product analytics.

ClickHouseData ModelingHive
0 likes · 17 min read
Retention Analysis Model Practice Based on ClickHouse
DataFunTalk
DataFunTalk
Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataHive
0 likes · 14 min read
Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 8, 2024 · Big Data

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

iQIYI migrated hundreds of petabytes of Hive tables to Apache Iceberg using dual‑write, in‑place, and CTAS strategies, combined with partition pruning, Bloom filters, and Trino/Alluxio optimizations, achieving up to 40% lower query latency, simplified pipelines, and faster, cost‑effective data lake operations.

Big DataHiveIceberg
0 likes · 20 min read
Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation
政采云技术
政采云技术
Jan 11, 2024 · Big Data

Overview of the Government Procurement Cloud Self-Service Data Extraction Platform

This article introduces the self‑service data extraction platform developed by the Government Procurement Cloud, detailing its architecture, core modules such as self‑service extraction, data push, resource management, operation audit, permission controls, performance optimizations, and future development plans.

Big DataData SecurityHive
0 likes · 9 min read
Overview of the Government Procurement Cloud Self-Service Data Extraction Platform
Weimob Technology Center
Weimob Technology Center
Jan 2, 2024 · Big Data

How to Efficiently Test BI Reports in a Hive‑StarRocks Data Warehouse

This article details practical methods for testing BI reports built on Hive and StarRocks, covering the report creation workflow, testing characteristics, SQL writing techniques, impact analysis, data warehouse simplification, and the application of data quality tools to ensure accurate and efficient reporting.

BI testingData WarehouseHive
0 likes · 9 min read
How to Efficiently Test BI Reports in a Hive‑StarRocks Data Warehouse
DataFunTalk
DataFunTalk
Dec 27, 2023 · Big Data

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

This article describes how NetEase Youdao replaced its Doris‑based real‑time data warehouse with Amoro Mixed Hive, detailing the architectural challenges, the Mixed Hive design, implementation steps, performance optimizations, community contributions, and future roadmap to achieve a unified lakehouse with minute‑level freshness and reduced development and operational costs.

AmoroBig DataFlink
0 likes · 12 min read
Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing
Selected Java Interview Questions
Selected Java Interview Questions
Nov 5, 2023 · Backend Development

Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders

This article presents a comprehensive design of a distributed reconciliation system that handles tens of millions of daily payment orders by using a six‑module architecture, Kafka for decoupled state transitions, Hive for large‑scale data processing, and Java‑based plug‑in patterns to achieve six‑nine accuracy and significant operational cost savings.

Big DataHiveJava
0 likes · 15 min read
Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders
政采云技术
政采云技术
Aug 23, 2023 · Big Data

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This comprehensive tutorial details the process of deploying a complete Hadoop-based big data ecosystem on ARM architecture, covering the installation and configuration of essential components including Java, Zookeeper, Hadoop, MySQL, Hive, and Spark with practical code examples.

ARM architectureBig DataCluster Deployment
0 likes · 19 min read
Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture
JD Retail Technology
JD Retail Technology
Aug 21, 2023 · Artificial Intelligence

ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios

This article examines how ChatGPT-4, as an advanced natural‑language‑processing model, can streamline data analysis tasks—from generating Hive table definitions and sample data to crafting complex HiveSQL queries, visualizing results, and implementing ClickHouse and Flink solutions—thereby improving efficiency, insight, and problem‑solving in big‑data environments.

Artificial IntelligenceBig DataChatGPT-4
0 likes · 7 min read
ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios
Qunar Tech Salon
Qunar Tech Salon
Aug 3, 2023 · Big Data

Optimizing Hotel List Page Data Warehouse Flow at Qunar.com: A Technical Case Study

This article presents a comprehensive case study of how Qunar.com’s hotel data warehouse team identified performance bottlenecks in the L‑page traffic table, applied a series of "split, lift, parallel, delay" strategies, and achieved significant reductions in processing time, storage usage, and query latency.

Big DataData WarehouseETL
0 likes · 21 min read
Optimizing Hotel List Page Data Warehouse Flow at Qunar.com: A Technical Case Study