Tagged articles

Hive

236 articles · Page 1 of 3

Jun 21, 2026 · Databases

Why Master‑Slave Replication Lags 5‑7 AM and How a Big‑Data Snapshot Fixes It

The article analyzes why the master‑slave database replication experiences 30‑minute delays each morning between 5 AM and 7 AM, traces the cause to massive inventory‑snapshot jobs, evaluates several mitigation options, and details a big‑data extraction workflow that eliminates the lag while reducing disk usage.

ElasticsearchHivebig data extraction

0 likes · 8 min read

Why Master‑Slave Replication Lags 5‑7 AM and How a Big‑Data Snapshot Fixes It

Big Data Tech Team

May 24, 2026 · Big Data

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

This guide compiles the most frequent interview pitfalls for data warehouse roles, covering SQL join and aggregation errors, window function misuse, subquery versus CTE performance myths, dimensional modeling mistakes, SCD implementation traps, layered design issues, data quality handling, ETL traps, Hive and Spark performance questions, real‑time warehousing considerations, and effective interview strategies.

Big DataETLHive

0 likes · 3 min read

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

dbaplus Community

May 20, 2026 · Databases

Stunning SQL Queries: From Tetris Game to Real‑Time Funnels

This article showcases a collection of impressive SQL queries—including a PostgreSQL Tetris implemented with a recursive CTE, window‑function session analysis, a ClickHouse real‑time funnel, dynamic WHERE clause generation, and a recursive employee hierarchy—while discussing performance tips and engine choices.

ClickHouseData WarehouseHive

0 likes · 25 min read

Stunning SQL Queries: From Tetris Game to Real‑Time Funnels

Architect-Kip

Mar 2, 2026 · Big Data

How to Build a Scalable Tiered Archive & Query System for MySQL Data

This article presents a comprehensive design for a layered storage and unified scheduling platform that archives MySQL historical data, reduces storage costs, ensures high‑performance queries, and enables efficient data analysis through tiered hot, warm, and cold storage using big‑data technologies.

Data ArchivingDorisFlink

0 likes · 13 min read

How to Build a Scalable Tiered Archive & Query System for MySQL Data

Big Data Tech Team

Jan 5, 2026 · Big Data

When to Use Hive Partitioning vs Bucketing: A Practical Guide

This article explains Hive's partitioning and bucketing techniques, compares their purposes, advantages, and pitfalls, and shows how to combine them with concrete SQL examples to improve query performance, reduce I/O, and optimize joins and sampling in large data warehouses.

BucketingData WarehouseHive

0 likes · 7 min read

When to Use Hive Partitioning vs Bucketing: A Practical Guide

Big Data Tech Team

Dec 25, 2025 · Big Data

How to Build an End‑to‑End E‑Commerce Data Warehouse for Interview Success

This guide walks you through designing and implementing a complete e‑commerce data‑warehouse project—from raw data ingestion and ODS/DWD/DWS/ADS layers to optional real‑time analytics—while highlighting interview‑ready resume tips, common pitfalls, and performance‑tuning tricks.

Big DataETLFlink

0 likes · 10 min read

How to Build an End‑to‑End E‑Commerce Data Warehouse for Interview Success

JD Tech

Dec 24, 2025 · Databases

How to Eliminate 30‑Minute Master‑Slave Lag in High‑Volume Inventory Systems

This article analyzes why a warehouse management system’s master‑slave database replication lagged up to 30 minutes during nightly inventory snapshot generation, evaluates several mitigation strategies, and details the chosen big‑data‑driven solution that moved snapshots to Elasticsearch, reducing lag and disk usage.

ElasticsearchHivedatabase replication

0 likes · 8 min read

How to Eliminate 30‑Minute Master‑Slave Lag in High‑Volume Inventory Systems

Big Data Tech Team

Oct 10, 2025 · Big Data

12 Essential Hive SQL Optimization Tricks to Boost Query Performance

This article presents twelve practical Hive SQL tuning techniques—ranging from avoiding COUNT(DISTINCT) to configuring parallel execution, reducer settings, and strict mode—to help data engineers reduce data skew, eliminate small files, improve resource utilization, and significantly accelerate query execution in large‑scale data warehouse environments.

Data WarehouseHivequery performance

0 likes · 11 min read

12 Essential Hive SQL Optimization Tricks to Boost Query Performance

Huolala Tech

Sep 26, 2025 · Big Data

How We Migrated 40 PB of Hive Data Across Clouds with Zero Downtime

This article details the end‑to‑end design, challenges, and implementation of a cross‑cloud migration of over 200 k Hive tables and nearly 40 PB of data using the self‑developed Kirk service, covering architecture, verification steps, and lessons learned to achieve 100 % data consistency without impacting production services.

Big DataData ConsistencyData Migration

0 likes · 20 min read

How We Migrated 40 PB of Hive Data Across Clouds with Zero Downtime

Big Data Tech Team

Aug 25, 2025 · Interview Experience

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

A comprehensive list of interview topics covering self‑introduction, career moves, data‑warehouse design, team building, architecture comparisons, fact‑table classification, common dimensions, performance tuning, and data‑governance for aspiring big‑data engineers.

Big DataData GovernanceFlink

0 likes · 4 min read

Essential Big Data Interview Questions for Data Warehouse Engineer Roles

Big Data Tech Team

Jul 17, 2025 · Big Data

Master Hive SQL: 10 Advanced Use Cases & Performance Optimizations for Hive 3.x

This article presents ten practical Hive SQL advanced scenarios—including session segmentation, funnel conversion, median calculation, array explosion, hierarchical recursion, deduplication, small‑file merging, conditional aggregation, approximate statistics, and data‑quality checks—each with full SQL code, key technical points, and optimization tips for Hive 3.x.

Data WarehouseHiveOptimization

0 likes · 9 min read

Master Hive SQL: 10 Advanced Use Cases & Performance Optimizations for Hive 3.x

Big Data Tech Team

Apr 27, 2025 · Big Data

10 Advanced Hive SQL Use Cases: Windows, Skew, JSON, and More

This article presents ten practical Hive SQL scenarios—including window functions for ranking, LAG for time‑interval analysis, random‑salt techniques to mitigate data skew, dynamic partition writes, JSON parsing with UDFs, retention calculations, consecutive‑login detection, regex‑based path analysis, CUBE multi‑dimensional aggregation, and ORC storage optimizations—each accompanied by optimization tips and complete code examples.

Data WarehouseHivePerformance Optimization

0 likes · 9 min read

10 Advanced Hive SQL Use Cases: Windows, Skew, JSON, and More

macrozheng

Apr 18, 2025 · Big Data

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive data, introduces Elasticsearch’s advantages, and details a practical architecture using Hive, Canal, and Otter to achieve near real‑time indexing of petabyte‑scale datasets with minimal latency.

Big DataCanalData Transfer Service

0 likes · 20 min read

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

Java Architect Essentials

Apr 15, 2025 · Databases

How to Remove Duplicate Rows in SQL: DISTINCT, GROUP BY, and ROW_NUMBER Explained

This article demonstrates three SQL techniques—DISTINCT, GROUP BY, and the ROW_NUMBER window function—for deduplicating records and counting unique tasks, comparing their syntax, performance, and behavior across MySQL, Hive, and Oracle environments.

DISTINCTDeduplicationGROUP BY

0 likes · 5 min read

How to Remove Duplicate Rows in SQL: DISTINCT, GROUP BY, and ROW_NUMBER Explained

Ma Wei Says

Mar 9, 2025 · Big Data

Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips

This article provides a comprehensive guide to designing the Data Warehouse Detail (DWD) layer, covering Kimball‑based design principles, step‑by‑step modeling, table and field naming conventions, concrete Hive DDL/DML examples, and optimization techniques such as partitioning, bucketing, and compression.

Big DataDWDData Warehouse

0 likes · 21 min read

Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips

Airbnb Technology Team

Jan 24, 2025 · Artificial Intelligence

Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning

Chronon is an open‑source framework that centralizes feature definitions to guarantee training‑inference consistency, eliminates complex ETL pipelines, and supports real‑time and batch processing across diverse data sources, cutting feature‑development cycles from months to under a week, as demonstrated by Airbnb’s 40,000‑feature deployment.

ChrononHiveMachine Learning

0 likes · 10 min read

Chronon — An Open-Source Framework for Production-Level Feature Engineering in Machine Learning

dbaplus Community

Jan 19, 2025 · Big Data

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

This article shares practical techniques for writing clean, efficient SQL in large‑scale data environments, covering predicate pushdown, sub‑queries, deduplication strategies, bucket optimization, and automation with Python‑Spark integration to improve readability and execution speed.

HiveOptimizationSpark

0 likes · 14 min read

How to Write Elegant, High‑Performance SQL for Big Data Pipelines

JD Tech

Dec 30, 2024 · Big Data

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

The article shares practical methods and code examples for making SQL both readable and high‑performing in large‑scale data platforms, covering predicate push‑down with subqueries, deduplication strategies, bucket utilization, and Python‑driven job parameter handling.

Data EngineeringHivePerformance

0 likes · 14 min read

Techniques for Writing Elegant and Efficient SQL in Big Data Environments

Past Memory Big Data

Dec 27, 2024 · Big Data

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Uber’s data lake on Hadoop stores hundreds of petabytes in Parquet files and, by adopting ZSTD compression, column pruning, and column reordering, achieves up to 79% storage reduction and significant vCore savings, with detailed benchmarks guiding optimal compression levels and open‑source contributions.

Apache ParquetBig DataHadoop

0 likes · 14 min read

How Uber Cuts Storage Costs with ZSTD Compression in Apache Parquet

Qunar Tech Salon

Dec 10, 2024 · Big Data

Understanding and Solving Small File Problems in Hive and Spark

This article explains what constitutes a small file in HDFS, why they harm memory, compute and cluster load, outlines common sources such as data sources, streaming and dynamic partitioning, and provides detailed Hive and Spark solutions—including CombineHiveInputFormat, merge parameters, distribute by, and custom Spark extensions—to efficiently merge small files and improve job performance.

Big DataHiveMapReduce

0 likes · 23 min read

Understanding and Solving Small File Problems in Hive and Spark

Su San Talks Tech

Dec 8, 2024 · Big Data

How to Build Near Real-Time ElasticSearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive datasets, introduces ElasticSearch’s inverted‑index architecture, and details a practical pipeline using Hive, wide tables, binlog, Canal, and Otter to achieve near real‑time indexing for petabyte‑level data.

CanalHiveOtter

0 likes · 19 min read

Baidu Tech Salon

Nov 20, 2024 · Big Data

Optimizing Multi‑Dimensional User Count Computation in Feed Using Data Tagging

By deduplicating logs and assigning compact numeric tags to each user‑dimension combination, the data‑tagging method replaces costly lateral‑view expansions with a user‑level aggregation, cutting shuffle volume from terabytes to gigabytes and reducing runtime from 49 minutes to 14 minutes, enabling scalable multi‑dimensional user‑count analysis for Baidu Feed.

HivePerformance Tuningdata tagging

0 likes · 14 min read

Optimizing Multi‑Dimensional User Count Computation in Feed Using Data Tagging

Shopee Tech Team

Oct 25, 2024 · Big Data

StarRocks at Shopee: Practical Use Cases and Performance Analysis

Shopee’s deployment of StarRocks across DataService, DataGo, and DataStudio demonstrates that its vectorized engine, cost‑based optimizer, and materialized‑view caching can query Hive, Iceberg, Delta Lake and Hudi up to 20,000× faster than Presto, cutting CPU usage and delivering consistently lower latency for complex analytics.

Data LakeHiveMPP

0 likes · 11 min read

StarRocks at Shopee: Practical Use Cases and Performance Analysis

Big Data Technology & Architecture

Sep 25, 2024 · Big Data

Learning Strategies and Interview Preparation Insights from a Big Data Student

The article shares practical study habits, detailed note‑taking, proactive questioning, effective communication, and a comprehensive set of interview questions covering Hive, Spark, Kafka, Flink, and other big‑data technologies, illustrated with real examples from a diligent student’s experience.

HiveKafkaLearning Strategies

0 likes · 7 min read

Learning Strategies and Interview Preparation Insights from a Big Data Student

IT Xianyu

Aug 26, 2024 · Big Data

Hive Data Warehouse: Modeling, Partitioning, and ID‑Mapping for User Profiles

This article explains how Hive serves as a data‑warehouse layer for user‑profile tagging, covering data‑warehouse fundamentals, fact‑and‑dimension modeling, partitioned storage, label aggregation, and ID‑mapping techniques with practical Hive DDL/DML examples.

Big DataData WarehouseETL

0 likes · 11 min read

Hive Data Warehouse: Modeling, Partitioning, and ID‑Mapping for User Profiles

Big Data Technology & Architecture

Aug 3, 2024 · Big Data

Comprehensive Big Data Interview Questions and Topics

This article compiles a wide range of interview questions covering JVM garbage collection, Hadoop, Hive, Flink, HBase, data warehousing, real‑time processing, and HR topics, providing a thorough preparation guide for candidates targeting senior big‑data positions.

FlinkHadoopHive

0 likes · 9 min read

Comprehensive Big Data Interview Questions and Topics

Architect

Jul 18, 2024 · Backend Development

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

This article details the architecture, design principles, data preparation methods, verification processes, and error‑handling strategies of ZuanZuan's payment reconciliation system, highlighting how large‑scale data, binlog ingestion, Hive archiving, and MQ‑based workflows ensure accurate and secure financial settlements.

HiveMQReconciliation

0 likes · 11 min read

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

Zhuanzhuan Tech

May 23, 2024 · Backend Development

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

This article details the background, architecture, data preparation methods, massive‑data handling strategies, verification processes, and error‑handling mechanisms of ZuanZuan's channel reconciliation system, highlighting design choices such as binlog ingestion, task‑driven bill downloads, sharding with Hive archiving, and MQ‑based reconciliation to ensure financial data consistency and safety.

HiveMQReconciliation

0 likes · 11 min read

Alibaba Cloud Developer

Apr 30, 2024 · Big Data

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

This article explores practical SQL optimization techniques for Alibaba's ODPS platform, covering fundamentals, common pitfalls like null handling and select *, advanced strategies such as multi‑insert, partition limiting, UDF placement, data‑skew mitigation, parameter tuning, and real‑world case studies that dramatically reduce query runtimes.

Big DataData SkewHive

0 likes · 23 min read

Mastering ODPS SQL: Proven Tips to Slash Query Time and Tackle Data Skew

Sohu Tech Products

Apr 24, 2024 · Big Data

How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

This article explains the concepts, formulas, and step‑by‑step implementation of a user‑retention analysis model, covering both Hive‑based offline processing and ClickHouse‑accelerated real‑time queries, complete with SQL examples, architecture diagrams, and practical optimization tips.

Big DataClickHouseData Visualization

0 likes · 19 min read

How to Build a ClickHouse‑Powered Retention Analysis Model for User Behavior

vivo Internet Technology

Apr 17, 2024 · Big Data

Retention Analysis Model Practice Based on ClickHouse

The article explains retention analysis models, their importance for user loyalty, outlines offline Hive architecture, then shows how ClickHouse’s retention() function and columnar storage dramatically speed up multi‑day retention calculations, providing SQL examples and practical guidance for product analytics.

ClickHouseHiveRetention Analysis

0 likes · 17 min read

Retention Analysis Model Practice Based on ClickHouse

DataFunTalk

Apr 9, 2024 · Big Data

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

Apache SparkBig DataData Migration

0 likes · 14 min read

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

Su San Talks Tech

Mar 9, 2024 · Big Data

How to Build Near‑Real‑Time Elasticsearch Indexes for PB‑Scale Data

Learn how to construct near‑real‑time Elasticsearch indexes for petabyte‑scale datasets by comparing MySQL limitations, leveraging inverted indexes, using Hive and wide tables, and employing binlog‑based pipelines with Canal and Otter to achieve second‑level index updates.

CanalElasticsearchHive

0 likes · 18 min read

iQIYI Technical Product Team

Mar 8, 2024 · Big Data

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

iQIYI migrated hundreds of petabytes of Hive tables to Apache Iceberg using dual‑write, in‑place, and CTAS strategies, combined with partition pruning, Bloom filters, and Trino/Alluxio optimizations, achieving up to 40% lower query latency, simplified pipelines, and faster, cost‑effective data lake operations.

Data LakeHiveIceberg

0 likes · 20 min read

Smooth Migration from Hive to Iceberg Data Lake at iQIYI: Architecture, Techniques, and Performance Evaluation

政采云技术

Jan 11, 2024 · Big Data

Overview of the Government Procurement Cloud Self-Service Data Extraction Platform

This article introduces the self‑service data extraction platform developed by the Government Procurement Cloud, detailing its architecture, core modules such as self‑service extraction, data push, resource management, operation audit, permission controls, performance optimizations, and future development plans.

Big DataData SecurityHive

0 likes · 9 min read

Overview of the Government Procurement Cloud Self-Service Data Extraction Platform

Weimob Technology Center

Jan 2, 2024 · Big Data

How to Efficiently Test BI Reports in a Hive‑StarRocks Data Warehouse

This article details practical methods for testing BI reports built on Hive and StarRocks, covering the report creation workflow, testing characteristics, SQL writing techniques, impact analysis, data warehouse simplification, and the application of data quality tools to ensure accurate and efficient reporting.

BI testingData QualityData Warehouse

0 likes · 9 min read

How to Efficiently Test BI Reports in a Hive‑StarRocks Data Warehouse

DataFunTalk

Dec 27, 2023 · Big Data

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

This article describes how NetEase Youdao replaced its Doris‑based real‑time data warehouse with Amoro Mixed Hive, detailing the architectural challenges, the Mixed Hive design, implementation steps, performance optimizations, community contributions, and future roadmap to achieve a unified lakehouse with minute‑level freshness and reduced development and operational costs.

AmoroBig DataFlink

0 likes · 12 min read

Amoro Mixed Hive: A Unified Lakehouse Solution for Real‑Time and Batch Data Processing

Selected Java Interview Questions

Nov 5, 2023 · Backend Development

Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders

This article presents a comprehensive design of a distributed reconciliation system that handles tens of millions of daily payment orders by using a six‑module architecture, Kafka for decoupled state transitions, Hive for large‑scale data processing, and Java‑based plug‑in patterns to achieve six‑nine accuracy and significant operational cost savings.

Big DataHiveJava

0 likes · 15 min read

Design and Implementation of a High‑Performance Distributed Reconciliation System for Large‑Scale Payment Orders

Past Memory Big Data

Oct 10, 2023 · Big Data

2023 Big Data Interview Guide: Hadoop, Hive, Doris, Data Warehouse Essentials

This comprehensive 2023 guide covers essential big‑data interview topics, providing detailed explanations and step‑by‑step processes for Hadoop HDFS read/write, YARN, Hive table types and optimizations, Doris architecture and data models, data‑warehouse layers, modeling techniques, quality monitoring, and classic algorithm design questions such as TOP‑K and duplicate detection.

Big DataData WarehouseDoris

0 likes · 54 min read

2023 Big Data Interview Guide: Hadoop, Hive, Doris, Data Warehouse Essentials

Big Data Technology & Architecture

Sep 14, 2023 · Big Data

Big Data Interview Guide: Common Questions from Leading Companies

This article compiles real interview experiences from a top tech firm and other leading companies, presenting a detailed list of common big‑data interview questions covering Hadoop, Hive, Spark, Flink, Kafka, data skew, HDFS architecture, and related concepts to help candidates prepare effectively.

Big DataFlinkHadoop

0 likes · 8 min read

Big Data Interview Guide: Common Questions from Leading Companies

政采云技术

Aug 23, 2023 · Big Data

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This comprehensive tutorial details the process of deploying a complete Hadoop-based big data ecosystem on ARM architecture, covering the installation and configuration of essential components including Java, Zookeeper, Hadoop, MySQL, Hive, and Spark with practical code examples.

ARM architectureCluster DeploymentHadoop

0 likes · 19 min read

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

JD Retail Technology

Aug 21, 2023 · Artificial Intelligence

ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios

This article examines how ChatGPT-4, as an advanced natural‑language‑processing model, can streamline data analysis tasks—from generating Hive table definitions and sample data to crafting complex HiveSQL queries, visualizing results, and implementing ClickHouse and Flink solutions—thereby improving efficiency, insight, and problem‑solving in big‑data environments.

Artificial IntelligenceBig DataChatGPT-4

0 likes · 7 min read

ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios

JD Retail Technology

Aug 16, 2023 · Big Data

Automating Real‑Time and Offline Data Verification for Ranking Lists during Large‑Scale Promotions

The article describes the evolution from manual to semi‑automatic and finally fully automatic solutions for verifying real‑time and offline ranking data during major sales events, detailing rule extraction, Hive‑based SQL generation, execution, and the resulting reduction in human effort.

AutomationData verificationHive

0 likes · 6 min read

Automating Real‑Time and Offline Data Verification for Ranking Lists during Large‑Scale Promotions

Big Data Technology & Architecture

Jul 17, 2023 · Big Data

Incremental Query of Hudi Tables Using Hive, Spark SQL, and Flink SQL

This guide explains how to perform incremental queries on Hudi tables by configuring Hive synchronization, using Spark SQL both programmatically and via pure SQL, and leveraging Flink SQL in batch and streaming modes, with detailed parameter settings and code examples.

Big DataFlink SQLHive

0 likes · 20 min read

Incremental Query of Hudi Tables Using Hive, Spark SQL, and Flink SQL

Big Data Technology & Architecture

Jun 27, 2023 · Big Data

Comprehensive Big Data Interview Experience and Questions Overview

The article presents a detailed three‑month interview journey that led to a position at a top new‑energy automotive firm, outlining the questions and topics covered in five interview rounds—including Hive, Spark, Flink, Kafka, data modeling, and data governance—to help candidates prepare for big‑data roles.

Big DataFlinkHive

0 likes · 7 min read

Comprehensive Big Data Interview Experience and Questions Overview

JD Tech

Jun 14, 2023 · Big Data

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

This article explains the concept of data skew in offline big‑data jobs, describes its symptoms and root causes, and provides practical optimization techniques for Hive and Spark—including partitioning strategies, map‑join usage, adaptive query settings, and monitoring approaches—to prevent performance degradation and runtime failures.

Data SkewHiveOptimization

0 likes · 17 min read

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

Alibaba Cloud Developer

Jun 14, 2023 · Big Data

How to Diagnose and Optimize Data Skew and Data Expansion in Big Data SQL

This article shares practical methods, based on real‑world team experience, to identify and resolve data skew and data expansion issues in big data SQL queries, offering systematic investigation steps and optimization techniques for Map, Reduce, and Join stages.

Big DataData SkewHive

0 likes · 9 min read

How to Diagnose and Optimize Data Skew and Data Expansion in Big Data SQL

iQIYI Technical Product Team

Jun 9, 2023 · Big Data

Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL

iQIYI accelerated its big‑data platform by migrating the OLAP layer from Hive to Spark SQL, achieving a 67 % speedup, 50 % CPU reduction and 44 % memory savings, while automating the conversion of tens of thousands of tasks and delivering faster analytics for advertising, BI, membership and user‑growth services.

Data MigrationHivePerformance Optimization

0 likes · 18 min read

Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL

vivo Internet Technology

May 24, 2023 · Big Data

Kafka Real-time Data Archiving to Hive: Flink SQL and DataStream Implementation Solutions

The article explains how to archive Kafka real‑time data to Hive using either Flink SQL, which quickly creates partitioned ORC tables but requires timezone handling, or Flink DataStream for more complex pipelines, and offers best‑practice guidance on data quality, system complexity, security, and performance.

Big DataData ArchivingDataStream

0 likes · 15 min read

Kafka Real-time Data Archiving to Hive: Flink SQL and DataStream Implementation Solutions

Big Data Technology & Architecture

May 19, 2023 · Big Data

Comprehensive Big Data Interview Q&A and Personal Project Summary

This article shares a recent graduate's successful job offer story, emphasizes preparing a detailed personal project summary, and provides extensive big‑data interview questions covering Hadoop, Spark, Flink, Kafka, Hive, ClickHouse, and related technologies to help candidates excel in interviews.

Big DataFlinkHadoop

0 likes · 15 min read

Comprehensive Big Data Interview Q&A and Personal Project Summary

Data Thinking Notes

May 10, 2023 · Big Data

Mastering Hive Small File Management: Strategies to Boost Performance

This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.

HiveOptimizationSmall Files

0 likes · 10 min read

Mastering Hive Small File Management: Strategies to Boost Performance

Big Data Technology & Architecture

May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataHiveSmall Files

0 likes · 9 min read

Strategies for Handling Small Files in Hive and Spark

政采云技术

Apr 18, 2023 · Big Data

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

This article explains how to perform task‑level data cost governance by collecting storage and compute metrics from Hive tables, Spark jobs, and HDFS FsImage files, then estimating monthly expenses using replication factors and resource‑usage rates, while providing practical SQL and shell examples.

Data Cost GovernanceHDFSHive

0 likes · 18 min read

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

JD Retail Technology

Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew

0 likes · 15 min read

Understanding Data Skew and Its Mitigation in Hive and Spark

ITPUB

Mar 28, 2023 · Big Data

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

This article details the migration from a traditional Hive‑based data warehouse to a lakehouse architecture using Apache Hudi, covering the original Lambda setup, its pain points, lake‑vs‑warehouse differences, Hudi features, integration challenges, practical solutions, and future roadmap.

Apache HudiBig DataData Warehouse

0 likes · 11 min read

How We Turned a Hive Data Warehouse into a Real‑Time Lakehouse with Apache Hudi

Data Thinking Notes

Mar 22, 2023 · Big Data

How to Optimize Compute Resource Governance in Data Warehouses with Spark & Hive

This article walks through practical steps for governing compute resources in a data warehouse, covering problem identification, strategic thinking, Spark and Hive tuning, small‑file handling, DQC improvement, high‑consumption task optimization, scheduling adjustments, and measurable performance gains.

Compute GovernanceHiveSpark

0 likes · 13 min read

How to Optimize Compute Resource Governance in Data Warehouses with Spark & Hive

DataFunSummit

Mar 20, 2023 · Backend Development

Unified UDF Implementation on Cloud Platform: Architecture, Features, and Open‑Source Contributions

This article introduces a unified User‑Defined Function (UDF) solution on a cloud data platform, detailing its remote execution architecture, compatibility with Hive UDFs, resource isolation, hot‑update capabilities, internal platform implementation, open‑source contributions to PrestoDB, and future work plans.

HiveServerlessUDF

0 likes · 11 min read

Unified UDF Implementation on Cloud Platform: Architecture, Features, and Open‑Source Contributions

Bilibili Tech

Mar 10, 2023 · Information Security

Data Security Construction in Berserker Platform

The article outlines Berserker’s comprehensive data‑security framework—built on the CIA triad and 5A methodology—that unifies authentication, authorization, access control, asset protection, and auditing across Hive, Kafka, ClickHouse and ETL tasks, describes the migration from version 1.0 to 2.0 with a redesigned permission system, workspaces, Casbin performance tweaks, and previews future fine‑grained, lifecycle‑wide security enhancements.

Access ControlAuthorizationBerserker platform

0 likes · 15 min read

Data Security Construction in Berserker Platform

Su San Talks Tech

Feb 27, 2023 · Big Data

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains how to construct near real-time Elasticsearch indexes for petabyte‑scale datasets by comparing MySQL limitations, introducing ES fundamentals, leveraging Hive and wide tables, and employing binlog‑based tools like Canal and Otter for low‑latency data synchronization.

CanalElasticsearchHive

0 likes · 22 min read

NetEase Yanxuan Technology Product Team

Feb 20, 2023 · Big Data

Data Task Optimization Techniques and Practices

The article surveys unconventional offline data‑task optimizations—such as distribution‑by, seeded random shuffling, explode‑based skew mitigation, hash bucketing, task‑parallelism tuning, and multi‑insert materialization—organized by point, line, and surface perspectives, and stresses that effective performance gains require both technical tricks and business‑driven pipeline adjustments.

Distributed ComputingHiveSQL tuning

0 likes · 16 min read

Data Task Optimization Techniques and Practices

Big Data Technology & Architecture

Feb 10, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook

This article presents a curated collection of big‑data learning resources, including interview guides, in‑depth articles on Flink, Spark, Hive, ClickHouse, data governance, and personal growth, offering readers a one‑stop reference to boost their big‑data expertise and interview readiness.

Big DataData GovernanceFlink

0 likes · 5 min read

The Most Comprehensive Big Data Interview Preparation Handbook

Big Data Technology & Architecture

Feb 9, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

This article presents a curated collection of the most comprehensive big‑data interview preparation resources, including expert guides, tutorials, and deep‑dive articles on Flink, Spark, Hive, ClickHouse, data governance, and related topics, accompanied by a call to engage with the content.

Big DataClickHouseData Governance

0 likes · 4 min read

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

Java High-Performance Architecture

Jan 5, 2023 · Databases

Scaling Billions of Orders: MySQL Sharding, ES & Hive Strategies

This article explains how to handle massive order volumes by classifying data into hot and cold tiers, storing them in MySQL, Elasticsearch, and Hive, and implementing sharding and partitioning strategies—including shard keys, modulo routing, and combined database‑table distribution—to achieve high throughput and low cost.

ElasticsearchHiveMySQL

0 likes · 8 min read

Scaling Billions of Orders: MySQL Sharding, ES & Hive Strategies

JD Tech

Jan 4, 2023 · Big Data

Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP

This article demonstrates how to build multi‑dimensional data cubes on JD's big‑data platform using Hive, comparing UNION ALL with the more concise WITH CUBE, GROUPING SETS, and WITH ROLLUP functions, and discusses practical pitfalls and optimization tips.

Big DataGrouping SetsHive

0 likes · 10 min read

Implementing Data Cubes in Hive Using WITH CUBE, GROUPING SETS, and WITH ROLLUP

Big Data Technology & Architecture

Jan 3, 2023 · Big Data

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.

Big DataFlinkHive

0 likes · 19 min read

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

Architecture Digest

Jan 2, 2023 · Databases

Database Sharding and Partitioning Strategy for High‑Volume Order Systems

This article explains how to classify massive order data into hot and cold segments, store them in MySQL, Elasticsearch and Hive respectively, and implement sharding and partitioning at both table and database levels using modulo and hash calculations to achieve scalable performance for billions of orders.

HivePartitioningarchitecture

0 likes · 8 min read

Database Sharding and Partitioning Strategy for High‑Volume Order Systems

Architect

Dec 30, 2022 · Databases

Database Sharding and Partitioning Strategy for High‑Volume Order Systems

The article explains how to handle billions of daily orders by classifying data into hot and cold segments, storing them in MySQL, Elasticsearch, and Hive, and applying sharding and partitioning techniques at both table and database levels to achieve scalable performance.

Data PartitioningElasticsearchHive

0 likes · 9 min read

Ziru Technology

Dec 16, 2022 · Big Data

How to Effectively Test Offline Data Metrics and Data Warehouse Pipelines

This article explains what data metrics are, compares offline metric testing with traditional testing, and provides a comprehensive step‑by‑step guide for testing data collection, ETL, warehouse models, metric calculations, scheduling, security, and API outputs in a Hive‑based data warehouse.

Data ValidationData WarehouseETL

0 likes · 9 min read

How to Effectively Test Offline Data Metrics and Data Warehouse Pipelines

Big Data Technology & Architecture

Dec 15, 2022 · Big Data

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

This technical article presents a comprehensive overview of migrating Hive SQL to Flink SQL, covering the motivations behind the migration, key challenges such as compatibility, stability and performance, practical implementation steps, a detailed demo, future development directions, and a Q&A session addressing common concerns.

Big DataData LakeFlink

0 likes · 13 min read

Migrating Hive SQL to Flink SQL: Motivation, Challenges, Practice, Demo, and Future Plans

Zhuanzhuan Tech

Dec 15, 2022 · Big Data

Zhuanzhuan User Profile Platform: Architecture, Tag Construction, Storage, and User Segmentation Practices

This article details Zhuanzhuan's user profile platform, covering its business-driven motivation, tag taxonomy, system architecture, data pipelines using Hive, ClickHouse and Spark, storage design, per‑user insight, segmentation techniques, ID‑mapping, and future plans for real‑time tagging.

Big DataData EngineeringHive

0 likes · 17 min read

DeWu Technology

Nov 30, 2022 · Big Data

Fundamentals and Implementation of Data Lineage in Big Data Environments

Data lineage in big‑data environments tracks how data moves and transforms—from source tables through SQL processing to final storage—enabling management tasks such as domain segmentation, performance tuning, anomaly detection, and dependency verification, with implementations ranging from simple regex extraction to robust AST parsing and optimization, as used by tools like Alibaba DataWorks and Apache Atlas.

ASTBig DataHive

0 likes · 7 min read

Fundamentals and Implementation of Data Lineage in Big Data Environments

Data Thinking Notes

Nov 22, 2022 · Big Data

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

A Sqoop job that normally finishes within 2.5 hours occasionally takes more than 8 hours due to data skew caused by an unsuitable split column, and the article details the investigation, root‑cause analysis, and a practical solution using a better split column and adjusted parallelism.

Big DataData SkewHive

0 likes · 5 min read

Why Sqoop Sync from RDS to Hive Stalls Over 8 Hours and How to Fix It

vivo Internet Technology

Nov 16, 2022 · Big Data

Vivo Hawking A/B Experiment Platform: Architecture, Practices, and Solutions

The Vivo Hawking platform provides a company‑wide, one‑stop A/B testing solution with a layered architecture, covariate‑balanced split algorithms, real‑time monitoring, and unified SDKs for Android, Java and H5, enabling thousands of daily experiments, automated analysis, and rapid product iteration across multiple departments.

Covariate balancingExperiment PlatformHive

0 likes · 22 min read

Vivo Hawking A/B Experiment Platform: Architecture, Practices, and Solutions

dbaplus Community

Oct 30, 2022 · Big Data

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

This article explains the importance of layering in data warehouse modeling, outlines the four ETL steps, describes common pitfalls, presents a typical technical stack, and details each warehouse layer (ODS, DWD, DWS, ADS) along with best‑practice naming conventions and implementation tips for big‑data environments.

ETLHiveSpark

0 likes · 38 min read

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

Bilibili Tech

Sep 30, 2022 · Big Data

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

RoaringBitmap improves traditional BitMap by lazily allocating four container types, compressing sparse data, and dynamically switching between array, bitmap, and run containers, enabling fast exact set operations that power big‑data systems such as Kylin, ClickHouse, and B‑Station’s user‑visit and crowd‑package pipelines, dramatically reducing memory use and processing latency.

Big DataBitmap CompressionClickHouse

0 likes · 16 min read

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

DataFunSummit

Sep 21, 2022 · Big Data

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

Apache DorisBig DataDMP

0 likes · 10 min read

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

DataFunTalk

Sep 15, 2022 · Big Data

Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations

This article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing the migration process, automated SQL conversion, result verification, stability and performance enhancements, meta‑store optimizations, and future work on remote shuffle and vectorized execution.

Data SkippingHiveMetaStore

0 likes · 28 min read

Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations

DaTaobao Tech

Sep 6, 2022 · Big Data

SQL Optimization Techniques for ODPS (Open Data Processing Service)

The article presents practical ODPS SQL optimization strategies—including explicit column selection, partition limiting, multi‑insert, proper handling of nulls, join‑type choices, map‑join and skew hints, bucketed tables, and tuned task parameters—illustrated with three real‑world cases that dramatically cut execution time and resource usage.

Big DataData SkewHive

0 likes · 23 min read

SQL Optimization Techniques for ODPS (Open Data Processing Service)

DataFunTalk

Aug 14, 2022 · Big Data

NetEase Yanxuan DMP Tag System Construction Practice

This article details NetEase Yanxuan’s DMP tag system, covering its platform overview, tag production workflow, storage architecture, high‑performance query techniques, and future plans, illustrating how data from multiple sources is processed through ODS, DWD, DM layers and leveraged via Spark, Hive, and Apache Doris for real‑time and offline analytics.

Apache DorisDMPHive

0 likes · 11 min read

NetEase Yanxuan DMP Tag System Construction Practice

ITPUB

Aug 1, 2022 · Big Data

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

This article details Bilibili's evolution from a Hadoop‑based offline platform to a Spark‑driven architecture, covering the Hive‑to‑Spark migration, automated SQL conversion, result validation, stability enhancements, performance tuning, meta‑store federation, and future directions for large‑scale data processing.

Big DataData SkippingHive

0 likes · 31 min read

How Bilibili Scaled Offline Computing: Migrating from Hive to Spark and Boosting Performance

ITPUB

Jul 23, 2022 · Information Security

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

This article details Bilibili's implementation of Apache Ranger for fine‑grained access control across Hadoop, HDFS, Hive, Spark, and Presto, covering architecture, API redesign, admin optimizations, gray‑release strategies, permission pre‑checks, data masking, and future plans for incremental policy loading.

Access ControlData SecurityHDFS

0 likes · 16 min read

How Bilibili Secured Hadoop: Ranger‑Based HDFS and Hive Access Control Deep Dive

Bilibili Tech

Jul 22, 2022 · Information Security

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Bilibili’s data platform redesigns Ranger‑based access control by simplifying HDFS and Hive policy APIs, parallelizing policy loading, adding gray‑release and pre‑check mechanisms, integrating fine‑grained Hive authorization with data‑masking, extending support to Spark and Presto, and planning incremental loading, policy fusion, and a NameNode proxy to boost security and performance.

Access ControlHDFSHive

0 likes · 15 min read

Design and Optimization of Ranger‑Based Access Control for HDFS and Hive in Bilibili's Data Platform

Alibaba Cloud Big Data AI Platform

Jul 21, 2022 · Big Data

Boosting Offline Data Warehouse Performance with DeltaLake: Key Strategies

This article details how Zuoyebang migrated its Hive‑based offline data warehouse to DeltaLake, addressing latency, incremental updates, and query performance through stream‑to‑batch processing, dynamic partition pruning, and Z‑order optimization, resulting in faster data readiness and analyst queries.

Big DataDeltaLakeHive

0 likes · 17 min read

Boosting Offline Data Warehouse Performance with DeltaLake: Key Strategies

Big Data Technology Architecture

Jun 8, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Comprehensive Performance Optimizations

The article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing migration tools, SQL conversion, result and resource comparison, shuffle stability, small‑file handling, runtime filters, data skipping, ZSTD support, Hive Metastore federation, traffic control, and future optimization directions.

Data MigrationHiveResource Management

0 likes · 29 min read

Bilibili Offline Computing Platform: Migration from Hive to Spark and Comprehensive Performance Optimizations

Bilibili Tech

May 31, 2022 · Big Data

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Bilibili migrated its massive offline platform from Hive to Spark using an automated SQL rewrite and dual‑run verification, cutting execution time over 40% and resource use 30%, while introducing small‑file merging, shuffle stability, runtime filters, data‑skipping, lineage tracking, auto‑parameter tuning, and metastore federation for robust large‑scale processing.

Big DataData EngineeringHive

0 likes · 30 min read

Bilibili Offline Computing Platform: Migration from Hive to Spark and Operational Practices

Big Data Technology & Architecture

May 17, 2022 · Big Data

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

This article provides a comprehensive guide to Apache Hudi, covering its basic concepts, timeline architecture, storage types (Copy‑On‑Write and Merge‑On‑Read), write operations, DeltaStreamer usage, Hive/Spark/Presto query integration, data management, indexing, compaction, and best‑practice recommendations for big‑data lake workloads.

Apache HudiBig DataCopy-on-Write

0 likes · 43 min read

Apache Hudi: Core Concepts, Architecture, Storage Types, Write Operations, Querying, and Management

ByteDance Data Platform

May 11, 2022 · Big Data

How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility

This article explains how to design and implement a SparkSQL server that lowers usage barriers and boosts efficiency by supporting standard JDBC interfaces, integrating Hive Server2 protocols, handling multi‑tenant authentication, managing Spark job lifecycles, and providing high‑availability through Zookeeper coordination.

HiveJDBCServer Architecture

0 likes · 15 min read

How to Build a High‑Performance SparkSQL Server with Hive JDBC Compatibility

Snowball Engineer Team

Apr 21, 2022 · Big Data

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

This article details the Snowball data team's migration from Hive3 on Tez to Spark SQL, covering the motivations, comparative performance tests, encountered compatibility issues, configuration work‑arounds, and future plans for consolidating ETL workloads on Spark.

Big DataData WarehouseETL

0 likes · 13 min read

Migrating from Hive3 on Tez to Spark SQL: Practices, Challenges, and Performance Evaluation

IEG Growth Platform Technology Team

Apr 18, 2022 · Big Data

Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)

This comprehensive article explains big data concepts, definitions from Gartner and IBM, real‑world use cases, the Hadoop ecosystem architecture, and detailed introductions to HDFS, MapReduce, YARN, Hive, and HBase, including practical examples and shell commands.

HBaseHDFSHadoop

0 likes · 42 min read

Big Data Overview: Definitions, Applications, Technology Stack, and Core Components (Hadoop, HDFS, MapReduce, YARN, Hive, HBase)

Big Data Technology & Architecture

Apr 15, 2022 · Big Data

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

This guide explains how to set up the Flink SQL client to work with Apache Iceberg, covering Scala version requirements, downloading and deploying Iceberg jars, configuring Hive and HDFS catalogs, creating databases and tables, performing insert and overwrite operations, and querying data in both batch and streaming modes.

Big DataCatalogFlink

0 likes · 18 min read

Configuring Flink SQL Client with Iceberg: Catalogs, DDL, Data Insertion and Query

Zuoyebang Tech Team

Apr 13, 2022 · Big Data

How Delta Lake Transformed Our Offline Data Warehouse Performance

This article details how ZuoYeBang's engineering team migrated their Hive‑based offline data warehouse to Delta Lake, tackling latency, scalability, and query‑performance challenges through stream‑to‑batch processing, data‑lake architecture, and optimizations like DPP and Z‑ordering.

Big DataDelta LakeHive

0 likes · 15 min read

How Delta Lake Transformed Our Offline Data Warehouse Performance

Big Data Technology & Architecture

Mar 22, 2022 · Big Data

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

This article provides a comprehensive, hands‑on tutorial for connecting a Hive data warehouse to ClickHouse via Seatunnel, covering environment setup, Hive and ClickHouse table creation, full and incremental data import scripts, execution examples, and practical troubleshooting tips.

Big DataClickHouseData Integration

0 likes · 10 min read

Integrating Hive Data Warehouse with ClickHouse Using Seatunnel: A Step‑by‑Step Guide

Big Data Technology & Architecture

Mar 7, 2022 · Big Data

Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool

This article introduces Apache Griffin, a model‑driven big‑data data‑quality monitoring platform, explains its key features, architecture, installation requirements, and provides step‑by‑step usage examples with Hive, Kafka and Spark integration.

Apache GriffinBig DataData Quality

0 likes · 9 min read

Apache Griffin: An Overview of the Big Data Data‑Quality Monitoring Tool

Big Data Technology & Architecture

Feb 28, 2022 · Big Data

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

This article provides a step‑by‑step guide on integrating Apache Hudi with Hive and Presto, demonstrates core Hudi operations such as insert, upsert, delete, query, and Hive synchronization using Scala code, and shows how to manage Hudi tables through Spark SQL DDL/DML commands.

Apache HudiBig DataData Lake

0 likes · 16 min read

Integrating Apache Hudi with Hive, Presto, and Spark SQL: Installation, Operations, and Query Examples

ByteDance Data Platform

Feb 21, 2022 · Big Data

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

This article examines how to design enterprise‑grade data warehouses by evaluating development convenience, ecosystem, decoupling, performance and security, compares Hive and SparkSQL along with other engines such as Presto, Doris and ClickHouse, and outlines best‑practice component selections for long‑running batch and interactive analytics.

Big DataData WarehouseETL

0 likes · 19 min read

Choosing the Right Components for Enterprise Data Warehouses: Hive vs SparkSQL

DataFunTalk

Feb 15, 2022 · Big Data

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

The article details Vipshop's multi‑dimensional use of SeaTunnel to integrate Hive and ClickHouse, describing data import/export challenges, tool selection among DataX, SeaTunnel and Spark, custom configurations, platform integration, and future improvements for high‑performance OLAP pipelines.

Big DataClickHouseData Integration

0 likes · 15 min read

SeaTunnel Multi‑Dimensional Practice at Vipshop: ClickHouse‑Hive Integration and Data Platform Integration

IT Architects Alliance

Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Big DataHiveSpark

0 likes · 32 min read

Designing a Daily Million-Transaction Payment Reconciliation System

IT Xianyu

Jan 28, 2022 · Big Data

Step-by-Step Guide to Installing and Configuring Hue on CentOS 7 with Hadoop, Hive, and YARN

This tutorial explains how to set up the Hue web UI on a CentOS 7 machine by installing required dependencies, compiling Hue, configuring HDFS, YARN and Hive integration files, starting Hive services, launching Hue, and accessing the interface, with all commands and configuration snippets provided.

Big DataCentOSHadoop

0 likes · 6 min read

Step-by-Step Guide to Installing and Configuring Hue on CentOS 7 with Hadoop, Hive, and YARN

IT Xianyu

Jan 27, 2022 · Big Data

Installing Apache Hive on macOS with Hadoop and MySQL Metastore

This tutorial provides step‑by‑step instructions for installing Hadoop 3.1.1, Homebrew, Hive, and configuring MySQL as Hive's metastore on macOS, including environment variable setup, hive‑site.xml configuration, MySQL connector placement, schema initialization, and verification commands.

Big DataHadoopHive

0 likes · 6 min read

Installing Apache Hive on macOS with Hadoop and MySQL Metastore