Tagged articles
232 articles
Page 2 of 3
Top Architect
Top Architect
Dec 13, 2021 · Big Data

Design and Implementation of BanYu's Big Data Access Control System

This article describes the evolution from an unsecured data warehouse to a comprehensive big‑data access control system at BanYu, detailing the background, data access methods, design goals, authentication and authorization mechanisms, policy configuration, integration with Metabase, and the overall workflow that balances security with efficiency.

Big DataLDAPPresto
0 likes · 15 min read
Design and Implementation of BanYu's Big Data Access Control System
Architecture Digest
Architecture Digest
Dec 11, 2021 · Big Data

Design and Implementation of BanYu's Big Data Permission System

This article describes the background, design goals, authentication and authorization mechanisms, system architecture, policy configuration, and Metabase integration of BanYu's big data permission system, highlighting how it balances security and efficiency across Hive, Presto, HDFS, and other components.

Apache RangerPrestoaccess control
0 likes · 16 min read
Design and Implementation of BanYu's Big Data Permission System
IT Architects Alliance
IT Architects Alliance
Dec 11, 2021 · Big Data

Design and Implementation of Banyu's Big Data Permission System

This article describes the background, design goals, authentication and authorization mechanisms, system architecture, policy configuration, and Metabase integration of Banyu's big data permission system, which secures Hive, Presto, HDFS and other data access components using Apache Ranger and LDAP.

Apache RangerBig DataLDAP
0 likes · 14 min read
Design and Implementation of Banyu's Big Data Permission System
21CTO
21CTO
Dec 9, 2021 · Big Data

Designing a Scalable Big Data Permission System: From Hive to Metabase

BanYu’s early data warehouse lacked any access controls, prompting the creation of a comprehensive big‑data permission system that integrates authentication and authorization across Hive, Presto, HDFS, and Metabase using LDAP, Ranger policies, workflow automation, and both synchronous and asynchronous policy initialization.

AuthorizationBig DataLDAP
0 likes · 16 min read
Designing a Scalable Big Data Permission System: From Hive to Metabase
Big Data Technology Architecture
Big Data Technology Architecture
Nov 23, 2021 · Big Data

Step-by-Step Guide to Setting Up Flink CDC with MySQL, Hudi, and Hive Integration on a Hadoop Cluster

This comprehensive tutorial walks through configuring a Hadoop‑based environment (Flink 1.13.1, Scala 2.11, CDH 6.2.0, Hive 2.1.1, Hudi 0.10), compiling Hudi, setting up Flink and MySQL binlog, creating CDC source and Hudi sink tables, running Flink jobs, and synchronizing the results to Hive partitions for query via Hive and Presto.

CDCFlinkHudi
0 likes · 15 min read
Step-by-Step Guide to Setting Up Flink CDC with MySQL, Hudi, and Hive Integration on a Hadoop Cluster
MaGe Linux Operations
MaGe Linux Operations
Nov 13, 2021 · Information Security

Hive Ransomware Targets Linux: Bugs, New Features, and Industry Shift

Security researchers at ESET reveal that the Hive ransomware group has expanded its attacks to Linux and FreeBSD systems, releasing a buggy yet feature‑rich Linux variant written in Go, while noting a broader industry trend of ransomware operators developing Linux encryptors to compromise virtualized server environments.

GoInformation SecurityVirtualization
0 likes · 4 min read
Hive Ransomware Targets Linux: Bugs, New Features, and Industry Shift
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 23, 2021 · Big Data

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

This article provides a comprehensive overview of Hive's execution engines—including MapReduce, Tez, and Spark—detailing their architectures, the six-stage Hive SQL compilation process, practical Explain syntax examples, and extensive tuning parameters for each engine to improve performance in big‑data environments.

MapReduceSQL OptimizationSpark
0 likes · 48 min read
Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Oct 18, 2021 · Big Data

Applying and Practicing Apache Hudi on Tongcheng Elong: Architecture, Challenges, and Solutions

This article describes the background, design choices, and practical challenges of using Apache Hudi for data updates on the Tongcheng Elong platform, analyzes three architectural alternatives, details Hudi's core configurations and write strategies, and presents concrete solutions to version compatibility, upsert semantics, insert behavior, partition management, streaming backlog monitoring, and business‑specific requirements, culminating in a productized Hudi service and future roadmap.

HudiUpserthive
0 likes · 18 min read
Applying and Practicing Apache Hudi on Tongcheng Elong: Architecture, Challenges, and Solutions
ITPUB
ITPUB
Sep 13, 2021 · Big Data

MapReduce vs MPP: Choosing the Right Engine for Global Data Warehousing

A team of engineers at MBI debates the merits of MapReduce, MPP, and Hive for their KeepS global data‑warehouse, discussing technical differences, scalability, concurrency, and the feasibility of mixed batch engines while navigating budget and operational constraints.

Cluster ComputingGrid ComputingMPP
0 likes · 20 min read
MapReduce vs MPP: Choosing the Right Engine for Global Data Warehousing
Ctrip Technology
Ctrip Technology
Sep 9, 2021 · Big Data

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

This article describes how Ctrip built a data lineage system for its big data platform, covering the concept of data lineage, collection methods, open‑source tools such as Apache Atlas and DataHub, the in‑house table‑level and field‑level solutions, implementation details for Hive, Spark and Presto, storage in JanusGraph, and practical applications in data governance, metadata management, scheduling and sensitivity labeling.

Big DataJanusGraphKafka
0 likes · 16 min read
Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications
Big Data Technology Architecture
Big Data Technology Architecture
Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

Big DataElasticsearchHBase
0 likes · 9 min read
Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch
DataFunTalk
DataFunTalk
Jul 26, 2021 · Big Data

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

This article describes how SmartNews integrated Flink into its Airflow‑driven Hive batch pipeline to cut the actions table generation latency from three hours to about thirty‑four minutes, detailing the technical challenges, design decisions, and production results.

AWSBig DataFlink
0 likes · 12 min read
Accelerating Hive Daily Tables with Flink: A SmartNews Case Study
UCloud Tech
UCloud Tech
Jul 13, 2021 · Big Data

Step‑by‑Step Guide to Deploy UCloud’s Free USDP Big Data Platform on CentOS

This article walks you through the complete installation and configuration of UCloud's free USDP (UCloud Data Platform) on a three‑node CentOS 7.2‑7.6 cluster, covering environment preparation, package download, repair scripts, MySQL setup, service startup, web UI activation, monitoring, and a quick Hive query example.

CentOSCluster DeploymentHadoop
0 likes · 19 min read
Step‑by‑Step Guide to Deploy UCloud’s Free USDP Big Data Platform on CentOS
dbaplus Community
dbaplus Community
Jul 4, 2021 · Big Data

How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture

This article explains Didi's end‑to‑end architecture for ingesting MySQL data into Hive using real‑time Binlog collection, a customized Canal component, message queues, HDFS storage, Dquality monitoring, and strategies for handling data drift and sharding in large‑scale big‑data environments.

Big DataCanalhive
0 likes · 13 min read
How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture
Youzan Coder
Youzan Coder
Jun 30, 2021 · Big Data

Online Monitoring Practices for Offline and Real-Time Data at Youzan

Youzan Data Report Center monitors offline batch and real‑time data pipelines using accuracy and timeliness rules, cross‑table checks, upstream‑downstream comparisons, and scheduled alerts to detect anomalies early; since 2021 it has generated over 25 alerts, and plans a unified data‑quality dashboard.

Big DataData QualityFlink
0 likes · 12 min read
Online Monitoring Practices for Offline and Real-Time Data at Youzan
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 21, 2021 · Big Data

Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases

This article provides an in‑depth overview of Apache Kylin, covering its history, mission, core MOLAP principles, technical architecture, step‑by‑step installation (Docker and Hadoop), performance tuning, advanced cube settings, and detailed case studies from major companies such as Baidu, Lianjia, and Didi.

Apache KylinCubeDocker
0 likes · 53 min read
Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases
DataFunTalk
DataFunTalk
Jun 21, 2021 · Big Data

Flink + Iceberg 0.11 Practices in Qunar Data Platform

This article shares Qunar's experience using Flink together with Apache Iceberg 0.11 to address real‑time data warehouse challenges, covering background pain points, Iceberg architecture, solutions for Kafka data loss and Hive latency, and optimization practices such as small‑file handling, sorting, and checkpoint management.

Big DataData LakeFlink
0 likes · 13 min read
Flink + Iceberg 0.11 Practices in Qunar Data Platform
dbaplus Community
dbaplus Community
Jun 1, 2021 · Big Data

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Didi migrated over 10,000 Hive SQL tasks to Spark SQL, achieving 85% Spark task share, cutting execution time by 40%, and reducing CPU and memory usage by 21% and 49% respectively, through a systematic migration process that addressed syntax, UDF, performance, and functional differences between the two engines.

Big DataSQL MigrationSpark
0 likes · 20 min read
How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark
HelloTech
HelloTech
May 14, 2021 · Big Data

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

The article describes a real‑time user behavior analysis platform built on a ClickHouse cluster, detailing its architecture, Hive‑to‑ClickHouse data ingestion with user‑ID routing, table designs for behavior and group data, and five analytical methods—event, funnel, path, retention, and attribution—leveraging shard‑level parallelism and custom functions for high efficiency.

AnalyticsBig Dataclickhouse
0 likes · 20 min read
User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 15, 2021 · Big Data

Hive and Hadoop Interview Questions and Answers

This article provides a comprehensive collection of interview-style questions and detailed answers covering Hive concepts, Hadoop architecture, MapReduce mechanics, HDFS operations, and performance optimization techniques for big‑data processing environments.

HadoopMapReducedata-warehouse
0 likes · 41 min read
Hive and Hadoop Interview Questions and Answers
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 23, 2021 · Big Data

Practical Implementations of Data Lakes: Huawei Production Scenario, Real-Time Financial Data Lake, and Soul's Delta Lake

This article presents a comprehensive overview of data lake implementations, detailing Huawei's production‑scene platform, a real‑time financial data lake architecture using Kafka, Flink and Iceberg, and Soul's Delta Lake practice with Spark, Hive, and custom ETL tools, highlighting design choices, processing flows, and operational considerations.

Data LakeDelta LakeFlink
0 likes · 8 min read
Practical Implementations of Data Lakes: Huawei Production Scenario, Real-Time Financial Data Lake, and Soul's Delta Lake
Big Data Technology Architecture
Big Data Technology Architecture
Mar 11, 2021 · Big Data

Challenges and Optimizations of Hive MetaStore at Kuaishou

This article details how Kuaishou tackled performance, scalability, and stability challenges of Hive MetaStore by introducing a BeaconServer hook architecture, read‑write separation, API refinements, traffic control, and federation designs, resulting in significant query efficiency and service reliability improvements.

FederationRead-Write Separationhive
0 likes · 14 min read
Challenges and Optimizations of Hive MetaStore at Kuaishou
DataFunTalk
DataFunTalk
Mar 10, 2021 · Big Data

Hive MetaStore Challenges and Optimizations at Kuaishou

At Kuaishou, the Hive MetaStore service, which stores metadata for Hive, faced scalability and performance challenges due to massive dynamic partitions and high query volume, leading to a series of architectural optimizations—including read‑write separation, API enhancements, traffic control, and federation—to improve stability and efficiency.

Big DataKuaishouMetaStore
0 likes · 15 min read
Hive MetaStore Challenges and Optimizations at Kuaishou
Big Data Technology Architecture
Big Data Technology Architecture
Mar 2, 2021 · Big Data

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how a data engineering team replaced nightly batch ETL with a Delta Lake‑based real‑time log ingestion pipeline on EMR, detailing the motivations, architecture, implementation steps, encountered issues such as data skew and schema evolution, and the practical solutions they applied to achieve low‑latency, reliable data delivery.

Delta LakeSparkhive
0 likes · 14 min read
Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions
DataFunTalk
DataFunTalk
Feb 5, 2021 · Big Data

Design and Implementation of Beike's Data Management Platform (DMP)

This article details how Beike built a comprehensive Data Management Platform (DMP) that integrates user behavior and business data across multiple apps, outlines its five‑layer architecture, discusses data collection, processing, storage, real‑time profiling, and presents performance results and future optimization directions.

Big DataDMPReal-time analytics
0 likes · 20 min read
Design and Implementation of Beike's Data Management Platform (DMP)
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 1, 2021 · Big Data

Flink 1.12 Enhancements: Full SQL Support, Hive Integration, and Streaming Write to Hive

The article reviews Flink 1.12's major enhancements, including comprehensive SQL capabilities, deep integration with Hive via catalog and streaming support, and a practical code example that demonstrates how to write streaming data into Hive tables while handling partition commits and small‑file merging.

Data IntegrationFlinkStreaming
0 likes · 7 min read
Flink 1.12 Enhancements: Full SQL Support, Hive Integration, and Streaming Write to Hive
Didi Tech
Didi Tech
Jan 25, 2021 · Big Data

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

DiDi migrated over 10,000 Hive SQL tasks to Spark SQL using a lightweight dual‑run pipeline that extracts, rewrites, compares, and switches tasks, fixing syntax and UDF differences while adding features such as small‑file merging and enhanced partition pruning, resulting in Spark handling 85 % of workloads with 40 % faster execution, 21 % lower CPU and 49 % lower memory usage.

DataMigrationSQLOptimizationSparkSQL
0 likes · 18 min read
Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi
dbaplus Community
dbaplus Community
Jan 5, 2021 · Big Data

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Facing massive, heterogeneous logs from numerous payment services, Ctrip’s data team designed a unified logging framework that extends log4j2, streams logs via Kafka to HDFS using a customized Camus pipeline, partitions and stores data in ORC for efficient Hive analysis, while addressing format, storage, and performance challenges.

Big DataCamusHadoop
0 likes · 16 min read
How Ctrip Built a Scalable Unified Log Framework for Payment Data
Architect
Architect
Dec 27, 2020 · Big Data

Optimizing Billion‑Scale Hive Queries: Partitioning, Indexing, Bucketing, Active‑User Segmentation, and Data Structure Refactoring

This article walks through the challenges of querying a 300‑billion‑row Hive table, analyzes why traditional partitioning, indexing, and bucketing fall short, and presents a practical solution that combines active‑user segmentation and a redesigned array‑based data model to cut query time from hours to minutes.

Big DataData Partitioningdata modeling
0 likes · 10 min read
Optimizing Billion‑Scale Hive Queries: Partitioning, Indexing, Bucketing, Active‑User Segmentation, and Data Structure Refactoring
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 9, 2020 · Big Data

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

The article explains why Hive tables generate many small files on HDFS, describes the performance impact on NameNode and MapReduce, and provides detailed configuration steps and compression techniques—including input and output file merging, various Hive file formats, and partition optimization—to efficiently manage storage and resource consumption in big‑data environments.

HadoopSmall Filescompression
0 likes · 19 min read
Handling Small Files in Hive: Configuration, Compression, and File Format Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 3, 2020 · Big Data

Hive Query Optimization Techniques and Best Practices

This article presents a comprehensive guide to optimizing Hive queries, covering limit adjustments, join strategies, local mode execution, parallelism, strict mode, mapper and reducer tuning, JVM reuse, dynamic partitioning, speculative execution, data skew handling, and small‑file mitigation techniques.

MapReduceSQL Optimizationhive
0 likes · 20 min read
Hive Query Optimization Techniques and Best Practices
Sohu Tech Products
Sohu Tech Products
Dec 2, 2020 · Big Data

Optimizing Hive SQL Lineage Parsing: Techniques, Implementation, and Practical Insights

This article presents a comprehensive overview of Hive SQL lineage parsing, detailing the challenges of data provenance in large‑scale data warehouses, introducing ANTLR‑based parsing techniques, and describing a series of optimizations—including AST pruning, CTE handling, UDF registration, and metadata service integration—to improve both table‑level and column‑level lineage extraction and visualization.

ANTLRSQL lineagedata-warehouse
0 likes · 18 min read
Optimizing Hive SQL Lineage Parsing: Techniques, Implementation, and Practical Insights
360 Tech Engineering
360 Tech Engineering
Nov 6, 2020 · Big Data

Guide to Flink SQL: Features, Scenarios, and Productization

Flink SQL, the high‑level SQL interface for Apache Flink, offers language‑independent, dependency‑free, easy‑to‑use stream processing with advanced features such as DDL, UDFs, time semantics, windowing, pattern matching, and built‑in connectors, supporting data synchronization, batch‑stream fusion, Hive integration, and various product enhancements.

Data IntegrationFlinkReal-Time
0 likes · 11 min read
Guide to Flink SQL: Features, Scenarios, and Productization
DataFunTalk
DataFunTalk
Nov 1, 2020 · Big Data

Flink 1.11 Integration with Hive: New Features and Real‑time Data Warehouse

The article explains how Flink 1.11 deepens its integration with Hive, covering background, new connector features, simplified dependency management, enhanced Hive dialect, streaming writes and reads, temporal table joins, and how these capabilities enable a unified batch‑streaming data warehouse.

Batch‑Streaming IntegrationFlinkStreaming
0 likes · 16 min read
Flink 1.11 Integration with Hive: New Features and Real‑time Data Warehouse
DataFunTalk
DataFunTalk
Oct 9, 2020 · Big Data

NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation

This article examines the pain points of traditional data warehouse platforms, explains the core concepts and advantages of the Iceberg data lake table format, compares it with Metastore, reviews the current Iceberg community ecosystem, and details NetEase’s practical integration with Hive, Impala, and Flink to improve ETL efficiency and support unified batch‑stream processing.

Data LakeETLFlink
0 likes · 13 min read
NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 24, 2020 · Big Data

HiveSQL Classic Optimization Cases: Partitioning, Subset Decomposition, and Percentile Approximation Improvements

This article presents three HiveSQL optimization case studies—restructuring a large‑scale query with partitioned tables, breaking a complex window‑function query into smaller subsets with joins, and refactoring excessive PERCENTILE_APPROX calls—demonstrating how each change reduces execution time from hours to minutes and improves overall performance.

Big DataHiveSQLPartitioning
0 likes · 9 min read
HiveSQL Classic Optimization Cases: Partitioning, Subset Decomposition, and Percentile Approximation Improvements
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 2, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Features, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark to ingest, manage, and incrementally query large analytical datasets on HDFS‑compatible storage, offering features such as timeline management, copy‑on‑write and merge‑on‑read tables, and support for snapshot, incremental, and read‑optimized queries across engines like Hive, Spark SQL and Presto.

Apache HudiBig DataData Lake
0 likes · 12 min read
An Overview of Apache Hudi: Architecture, Features, and Query Types
dbaplus Community
dbaplus Community
Sep 1, 2020 · Big Data

Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive

This article presents a systematic guide to real‑time MySQL binlog ingestion, outlining three core principles—decoupling from business data, handling schema changes, and ensuring traceability—followed by concrete Debezium‑Kafka‑Hive solutions, scenario‑specific tactics, and practical tips for reliable data pipelines.

DebeziumKafkadata ingestion
0 likes · 15 min read
Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive
Huolala Tech
Huolala Tech
Aug 4, 2020 · Big Data

How to Accelerate Hive UDFs by Caching Large Geo Data: A 140× Speed Boost

To dramatically improve Hive UDF performance when converting coordinates to administrative districts, this article compares two implementation strategies, details the technical challenges of repeatedly loading a 157 MB Geo data file, and presents a static‑cached solution that reduces query time from seconds to milliseconds, achieving roughly a 140‑fold speed increase.

Static CachingUDFhive
0 likes · 15 min read
How to Accelerate Hive UDFs by Caching Large Geo Data: A 140× Speed Boost
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 30, 2020 · Big Data

Understanding Bucket Sampling Queries in Hive

This article explains Hive's bucket sampling syntax, demonstrates how to use the TABLESAMPLE clause with various bucket parameters, provides concrete SQL examples, and clarifies the underlying hash‑based mechanism that determines which rows are returned.

Big DataBucket SamplingTablesample
0 likes · 4 min read
Understanding Bucket Sampling Queries in Hive
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 19, 2020 · Big Data

An Overview of Hive, HBase Integration, Apache Phoenix, and Lealone in the Big Data Ecosystem

This article explains Hive's role as a Hadoop‑based data warehouse, its integration with HBase, the advantages and drawbacks of that combination, introduces Apache Phoenix as a high‑performance SQL layer on HBase, and describes the open‑source NewSQL database Lealone, providing practical usage scenarios and performance comparisons.

Big DataHBaseLealone
0 likes · 9 min read
An Overview of Hive, HBase Integration, Apache Phoenix, and Lealone in the Big Data Ecosystem
Youzan Coder
Youzan Coder
Jul 1, 2020 · Big Data

Mastering HiveCube: Efficient Multi‑Dimensional Aggregation with Grouping Sets

This article explains how HiveCube can replace traditional development for multi‑dimensional aggregation in a data‑warehouse, covering background, theory of cube, with‑cube/rollup/grouping‑sets syntax, grouping_id handling, practical implementation tips, performance tuning, and a comparison with conventional methods.

Big DataCubeGrouping Sets
0 likes · 19 min read
Mastering HiveCube: Efficient Multi‑Dimensional Aggregation with Grouping Sets
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 15, 2020 · Big Data

Hive Optimization Techniques and Best Practices for Big Data Processing

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, tuning MapReduce jobs, handling data skew in joins, and selecting appropriate storage formats for large‑scale data warehouses.

Big DataData SkewHiveQL
0 likes · 19 min read
Hive Optimization Techniques and Best Practices for Big Data Processing
Big Data Technology Architecture
Big Data Technology Architecture
May 15, 2020 · Big Data

Performance Tuning of Hive on Spark in YARN Mode

This article explains how to optimize Hive on Spark running on YARN, covering YARN node resource configuration, Spark executor and driver memory settings, dynamic allocation, parallelism, and key Hive parameters to achieve superior performance compared to Hive on MapReduce.

Cluster ConfigurationSparkYARN
0 likes · 11 min read
Performance Tuning of Hive on Spark in YARN Mode
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 2, 2020 · Big Data

Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets

This article demonstrates how to create Hive tables for student, course, teacher, and score data, generate CSV files, load them into Hive, and provides a comprehensive set of Hive SQL queries covering data retrieval, aggregation, ranking, and statistical analysis for educational datasets.

Big DataQuery Examplesdata-warehouse
0 likes · 21 min read
Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets
Big Data Technology Architecture
Big Data Technology Architecture
Mar 19, 2020 · Big Data

Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations

Data skew in Hive MapReduce jobs, caused by uneven key distribution during joins, group‑by, or COUNT(DISTINCT) operations, can severely slow tasks, and the article explains common scenarios and practical solutions such as using MapJoin, enabling map‑side aggregation, load‑balancing, and rewriting queries to mitigate skew.

Data SkewMapJoinMapReduce
0 likes · 7 min read
Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations
Youzan Coder
Youzan Coder
Mar 18, 2020 · Big Data

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

AirflowBig DataData Governance
0 likes · 20 min read
The Evolution of Youzan’s Data Warehouse in a Big Data Environment
ITPUB
ITPUB
Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataReliabilitySpark
0 likes · 16 min read
How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains
ITPUB
ITPUB
Dec 17, 2019 · Databases

Mastering LEFT JOIN: Common Pitfalls and Practical Solutions

This article explains the fundamentals of LEFT JOIN in SQL, illustrates one‑to‑one, one‑to‑many, and many‑to‑many scenarios, compares ON versus WHERE conditions, and provides concrete MySQL and Hive examples with code snippets and visual diagrams to avoid common mistakes.

Join TypesLEFT JOINdatabase
0 likes · 14 min read
Mastering LEFT JOIN: Common Pitfalls and Practical Solutions
ITPUB
ITPUB
Dec 9, 2019 · Fundamentals

Master Date Operations in pandas and SQL: Retrieval, Conversion, and Calculation

This tutorial walks through loading order data into pandas and SQL, then demonstrates how to retrieve current dates, extract date components, convert between readable dates and Unix timestamps, transform between 10‑digit and 8‑digit date formats, and perform date arithmetic using pandas, MySQL, and Hive.

data-processingdate handlingdatetime
0 likes · 16 min read
Master Date Operations in pandas and SQL: Retrieval, Conversion, and Calculation
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 22, 2019 · Big Data

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

This article explains how to design and implement a real‑time data verification framework using Flink to generate wide tables, storing detailed records in Elasticsearch or HDFS with Hive for cross‑checking against offline data, ensuring trustworthy metrics for dashboards and stakeholders.

Big DataData verificationElasticsearch
0 likes · 7 min read
Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive
dbaplus Community
dbaplus Community
Jul 10, 2019 · Big Data

How Kuaishou Scales SQL on Hadoop: Architecture, Optimizations, and Lessons Learned

This article explains the SQL‑on‑Hadoop ecosystem—including Hive, Spark, SparkSQL, Presto and other solutions—then details Kuaishou's large‑scale platform architecture, performance bottlenecks, routing logic, high‑availability mechanisms, and a series of concrete optimizations that improve query speed, resource utilization, and operational stability.

SQL on HadoopSparkhigh availability
0 likes · 19 min read
How Kuaishou Scales SQL on Hadoop: Architecture, Optimizations, and Lessons Learned
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 24, 2019 · Big Data

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, fine‑tuning join operations, and optimizing MapReduce parameters such as mapper/reducer counts, file merging, compression, JVM reuse, parallel execution, strict mode, and storage formats.

Big DataJOIN optimizationMapReduce
0 likes · 19 min read
Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 17, 2019 · Big Data

Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples

This article introduces Spark SQL fundamentals, including its architecture, DataFrame and Dataset abstractions, query methods, interoperability with RDD, user-defined functions, integration with Hive, data source handling, and provides step‑by‑step Scala code examples for loading data, performing aggregations, and solving common analytical tasks.

DataFramesScalaSparkSQL
0 likes · 15 min read
Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples
Tencent Cloud Developer
Tencent Cloud Developer
May 21, 2019 · Information Security

Design and Implementation of a Cloud Audit Solution for Tencent Cloud Accounts

The article details a scalable, extensible cloud‑audit architecture for Tencent Cloud accounts that stores API logs in a Shanghai‑region COS bucket, processes them with EMR‑based Hive tables and hourly partition scripts, aggregates results into a hot MySQL store, and enables administrators to monitor all sub‑accounts with a real‑time “god view.”

COSEMRPython
0 likes · 13 min read
Design and Implementation of a Cloud Audit Solution for Tencent Cloud Accounts
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 24, 2019 · Big Data

Hive SQL Optimization Techniques and Best Practices

This article provides a comprehensive guide to Hive SQL performance tuning, covering optimization goals, common pitfalls, execution flow, table and job settings, map, shuffle, reduce, and query-level improvements such as join, bucket join, group‑by, and count‑distinct optimizations.

Big DataHadoophive
0 likes · 11 min read
Hive SQL Optimization Techniques and Best Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 17, 2019 · Big Data

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

This tutorial provides a comprehensive, step-by-step procedure for setting up Hive 2.1.0 on a Hadoop 2.7.1 cluster running Ubuntu 14.04, covering environment preparation, Hive installation, configuration of environment variables, MySQL metastore integration, client setup, service startup, and basic verification commands.

Big DataHadoopInstallation
0 likes · 8 min read
Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)