Tagged articles

Hive

236 articles · Page 2 of 3

Jan 24, 2022 · Big Data

How to Build a Scalable Big Data Access Control System with Hive, Presto, and Ranger

This article details the design and implementation of a comprehensive big data permission system that integrates Hive, Presto, Hadoop, and Metabase, covering data access methods, authentication choices, Ranger-based authorization, policy management, and automated workflow integration to balance security and efficiency.

Access ControlApache RangerBig Data

0 likes · 16 min read

How to Build a Scalable Big Data Access Control System with Hive, Presto, and Ranger

Big Data Technology & Architecture

Dec 28, 2021 · Big Data

Comprehensive Guide to Spark SQL: Concepts, DataSet/DataFrame, Functions, Optimization and Common Pitfalls

This article provides an in‑depth overview of Spark SQL, covering its architecture, DataSet/DataFrame creation, DSL and SQL usage, integration with Hive, custom UDF/UDAF/Aggregator implementations, handling of small files, Cartesian product detection, and a catalog of useful built‑in functions and window operations.

Big DataHiveSpark SQL

0 likes · 29 min read

Comprehensive Guide to Spark SQL: Concepts, DataSet/DataFrame, Functions, Optimization and Common Pitfalls

DataFunTalk

Dec 27, 2021 · Big Data

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

This article presents a detailed interview-style walkthrough covering Hadoop cluster setup, HDFS components, MapReduce workflow, YARN advantages, Spark fundamentals, Kafka replication, Hive table types, and related big‑data concepts, providing concise explanations and practical insights for data engineers.

Big DataHadoopHive

0 likes · 20 min read

Comprehensive Big Data Interview Q&A: Hadoop, Spark, Kafka, Hive, and Related Technologies

Big Data Technology & Architecture

Dec 18, 2021 · Big Data

Slowly Changing Dimensions (SCD) – Design Principles, Challenges, and Hive Implementation

This article explains the concept of Slowly Changing Dimensions (SCD), discusses practical design questions, compares three change‑tracking requirements, presents three implementation patterns, and provides detailed Hive/SQL examples for historical data initialization and incremental updates in large‑scale data warehouses.

Big DataData WarehouseHive

0 likes · 20 min read

Slowly Changing Dimensions (SCD) – Design Principles, Challenges, and Hive Implementation

Top Architect

Dec 13, 2021 · Big Data

Design and Implementation of BanYu's Big Data Access Control System

This article describes the evolution from an unsecured data warehouse to a comprehensive big‑data access control system at BanYu, detailing the background, data access methods, design goals, authentication and authorization mechanisms, policy configuration, integration with Metabase, and the overall workflow that balances security with efficiency.

Access ControlBig DataHive

0 likes · 15 min read

Design and Implementation of BanYu's Big Data Access Control System

Architecture Digest

Dec 11, 2021 · Big Data

Design and Implementation of BanYu's Big Data Permission System

This article describes the background, design goals, authentication and authorization mechanisms, system architecture, policy configuration, and Metabase integration of BanYu's big data permission system, highlighting how it balances security and efficiency across Hive, Presto, HDFS, and other components.

Access ControlApache RangerData Security

0 likes · 16 min read

Design and Implementation of BanYu's Big Data Permission System

IT Architects Alliance

Dec 11, 2021 · Big Data

Design and Implementation of Banyu's Big Data Permission System

This article describes the background, design goals, authentication and authorization mechanisms, system architecture, policy configuration, and Metabase integration of Banyu's big data permission system, which secures Hive, Presto, HDFS and other data access components using Apache Ranger and LDAP.

Access ControlApache RangerBig Data

0 likes · 14 min read

21CTO

Dec 9, 2021 · Big Data

Designing a Scalable Big Data Permission System: From Hive to Metabase

BanYu’s early data warehouse lacked any access controls, prompting the creation of a comprehensive big‑data permission system that integrates authentication and authorization across Hive, Presto, HDFS, and Metabase using LDAP, Ranger policies, workflow automation, and both synchronous and asynchronous policy initialization.

AuthorizationBig DataData Security

0 likes · 16 min read

Designing a Scalable Big Data Permission System: From Hive to Metabase

DataFunSummit

Dec 4, 2021 · Big Data

Building a Real-Time Data Warehouse with Flink: Hive Integration, Upsert‑Kafka, and CDC Connectors

This tutorial explains how to use Apache Flink 1.12 to construct a unified streaming‑batch data warehouse by integrating Hive via HiveCatalog and HiveDialect, performing read/write operations, configuring upsert‑Kafka sinks, and leveraging Flink CDC connectors for change data capture from MySQL and other sources.

CDCFlinkHive

0 likes · 46 min read

Building a Real-Time Data Warehouse with Flink: Hive Integration, Upsert‑Kafka, and CDC Connectors

Big Data Technology & Architecture

Nov 30, 2021 · Big Data

Curated Learning Resources for Big Data and Data Engineering

This article compiles a comprehensive list of Chinese-language articles and tutorials covering big‑data technologies such as Flink, Spark, Hive, ClickHouse, data governance, and related interview preparation resources, providing a structured learning path for aspiring data engineers.

Big DataClickHouseData Governance

0 likes · 4 min read

Curated Learning Resources for Big Data and Data Engineering

Big Data Technology & Architecture

Nov 28, 2021 · Big Data

Designing Hive Data Warehouse Schemas: Fact & Dimension Tables, Partitioning, Tag Aggregation, and ID Mapping

This article explains how to design Hive data warehouse schemas, covering fact and dimension table modeling, partitioned storage strategies, tag aggregation techniques, and ID‑mapping implementations using Hive SQL and UDFs to support user profiling and analytics.

Big DataData WarehouseETL

0 likes · 15 min read

Designing Hive Data Warehouse Schemas: Fact & Dimension Tables, Partitioning, Tag Aggregation, and ID Mapping

Big Data Technology Architecture

Nov 23, 2021 · Big Data

Step-by-Step Guide to Setting Up Flink CDC with MySQL, Hudi, and Hive Integration on a Hadoop Cluster

This comprehensive tutorial walks through configuring a Hadoop‑based environment (Flink 1.13.1, Scala 2.11, CDH 6.2.0, Hive 2.1.1, Hudi 0.10), compiling Hudi, setting up Flink and MySQL binlog, creating CDC source and Hudi sink tables, running Flink jobs, and synchronizing the results to Hive partitions for query via Hive and Presto.

CDCFlinkHive

0 likes · 15 min read

Step-by-Step Guide to Setting Up Flink CDC with MySQL, Hudi, and Hive Integration on a Hadoop Cluster

MaGe Linux Operations

Nov 13, 2021 · Information Security

Hive Ransomware Targets Linux: Bugs, New Features, and Industry Shift

Security researchers at ESET reveal that the Hive ransomware group has expanded its attacks to Linux and FreeBSD systems, releasing a buggy yet feature‑rich Linux variant written in Go, while noting a broader industry trend of ransomware operators developing Linux encryptors to compromise virtualized server environments.

HiveInformation SecurityMalware Analysis

0 likes · 4 min read

Hive Ransomware Targets Linux: Bugs, New Features, and Industry Shift

Big Data Technology & Architecture

Oct 23, 2021 · Big Data

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

This article provides a comprehensive overview of Hive's execution engines—including MapReduce, Tez, and Spark—detailing their architectures, the six-stage Hive SQL compilation process, practical Explain syntax examples, and extensive tuning parameters for each engine to improve performance in big‑data environments.

EXPLAINHiveMapReduce

0 likes · 48 min read

Understanding Hive Execution Engines: MapReduce, Tez, and Spark – Principles, Optimization, and Explain Usage

Tongcheng Travel Technology Center

Oct 18, 2021 · Big Data

Applying and Practicing Apache Hudi on Tongcheng Elong: Architecture, Challenges, and Solutions

This article describes the background, design choices, and practical challenges of using Apache Hudi for data updates on the Tongcheng Elong platform, analyzes three architectural alternatives, details Hudi's core configurations and write strategies, and presents concrete solutions to version compatibility, upsert semantics, insert behavior, partition management, streaming backlog monitoring, and business‑specific requirements, culminating in a productized Hudi service and future roadmap.

HiveHudiUpsert

0 likes · 18 min read

Applying and Practicing Apache Hudi on Tongcheng Elong: Architecture, Challenges, and Solutions

StarRocks

Sep 24, 2021 · Big Data

How Didi Scaled Real‑Time Funnel Analysis with StarRocks: Architecture, Design, and Performance Tips

Didi's data architecture team migrated high‑volume, real‑time funnel analysis from ClickHouse to StarRocks, built a multi‑layer pipeline with Kafka, Flink/Spark, Hive, and materialized views, and achieved sub‑3‑second query times on billions of rows, while outlining future enhancements.

Big DataFunnel AnalysisHive

0 likes · 14 min read

How Didi Scaled Real‑Time Funnel Analysis with StarRocks: Architecture, Design, and Performance Tips

DataFunTalk

Sep 20, 2021 · Databases

Using GPLoad to Batch Load HDFS Data into Greenplum: Comparison with Hive and MPP Database Options

The article compares Hive and Greenplum as offline and MPP data‑warehouse solutions, reviews Hive query engine alternatives, and provides a detailed tutorial—including YAML configuration and a shell script—for using GPLoad to import HDFS data into Greenplum.

Big DataGPLoadGreenplum

0 likes · 8 min read

Using GPLoad to Batch Load HDFS Data into Greenplum: Comparison with Hive and MPP Database Options

ITPUB

Sep 13, 2021 · Big Data

MapReduce vs MPP: Choosing the Right Engine for Global Data Warehousing

A team of engineers at MBI debates the merits of MapReduce, MPP, and Hive for their KeepS global data‑warehouse, discussing technical differences, scalability, concurrency, and the feasibility of mixed batch engines while navigating budget and operational constraints.

Cluster ComputingGrid ComputingHive

0 likes · 20 min read

MapReduce vs MPP: Choosing the Right Engine for Global Data Warehousing

Ctrip Technology

Sep 9, 2021 · Big Data

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

This article describes how Ctrip built a data lineage system for its big data platform, covering the concept of data lineage, collection methods, open‑source tools such as Apache Atlas and DataHub, the in‑house table‑level and field‑level solutions, implementation details for Hive, Spark and Presto, storage in JanusGraph, and practical applications in data governance, metadata management, scheduling and sensitivity labeling.

Big DataHiveJanusGraph

0 likes · 16 min read

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

Big Data Technology Architecture

Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

Big DataElasticsearchHBase

0 likes · 9 min read

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

DataFunTalk

Jul 26, 2021 · Big Data

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

This article describes how SmartNews integrated Flink into its Airflow‑driven Hive batch pipeline to cut the actions table generation latency from three hours to about thirty‑four minutes, detailing the technical challenges, design decisions, and production results.

AWSBig DataFlink

0 likes · 12 min read

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

Big Data Technology & Architecture

Jul 15, 2021 · Big Data

Understanding Hive Architecture, Execution Flow, and the Shift to Tez and Spark

This article explains Hive's core components, execution architecture, how HiveQL is transformed into MapReduce jobs, the advantages of Tez over MapReduce in Hive 3.0+, and the integration of Spark with Hive for modern big‑data processing.

Data WarehouseHiveMapReduce

0 likes · 9 min read

Understanding Hive Architecture, Execution Flow, and the Shift to Tez and Spark

UCloud Tech

Jul 13, 2021 · Big Data

Step‑by‑Step Guide to Deploy UCloud’s Free USDP Big Data Platform on CentOS

This article walks you through the complete installation and configuration of UCloud's free USDP (UCloud Data Platform) on a three‑node CentOS 7.2‑7.6 cluster, covering environment preparation, package download, repair scripts, MySQL setup, service startup, web UI activation, monitoring, and a quick Hive query example.

CentOSCluster DeploymentHadoop

0 likes · 19 min read

Step‑by‑Step Guide to Deploy UCloud’s Free USDP Big Data Platform on CentOS

Big Data Technology & Architecture

Jul 8, 2021 · Big Data

Using Flink CDC to Write Data into Apache Hudi and Query with Hive and Spark SQL

This guide walks through preparing the environment, creating a MySQL source table, configuring Flink CDC to ingest data into an Apache Hudi table, and then querying the Hudi data using both Hive and Spark‑SQL, including handling of partitions, realtime input formats, and required configuration settings.

CDCDataPipelineFlink

0 likes · 10 min read

Using Flink CDC to Write Data into Apache Hudi and Query with Hive and Spark SQL

dbaplus Community

Jul 4, 2021 · Big Data

How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture

This article explains Didi's end‑to‑end architecture for ingesting MySQL data into Hive using real‑time Binlog collection, a customized Canal component, message queues, HDFS storage, Dquality monitoring, and strategies for handling data drift and sharding in large‑scale big‑data environments.

Big DataCanalHive

0 likes · 13 min read

How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture

Youzan Coder

Jun 30, 2021 · Big Data

Online Monitoring Practices for Offline and Real-Time Data at Youzan

Youzan Data Report Center monitors offline batch and real‑time data pipelines using accuracy and timeliness rules, cross‑table checks, upstream‑downstream comparisons, and scheduled alerts to detect anomalies early; since 2021 it has generated over 25 alerts, and plans a unified data‑quality dashboard.

Big DataData QualityFlink

0 likes · 12 min read

Online Monitoring Practices for Offline and Real-Time Data at Youzan

dbaplus Community

Jun 23, 2021 · Big Data

How Ctrip Finance Built a Real‑Time Binlog‑Based Data Lake with MySQL‑Hive Sync

This article details Ctrip Finance's end‑to‑end data‑foundation architecture that uses MySQL binlog collection via Canal, Kafka streaming, Spark‑Streaming persistence to HDFS, and a merge process to produce timely MySQL‑Hive snapshots, addressing performance, consistency, and delete‑handling challenges.

BinlogHiveKafka

0 likes · 17 min read

How Ctrip Finance Built a Real‑Time Binlog‑Based Data Lake with MySQL‑Hive Sync

Didi Tech

Jun 22, 2021 · Big Data

MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices

DiDi’s real‑time MySQL‑to‑Hive pipeline captures row‑mode binlog with a custom Canal component, converts it to JSON, streams it via Kafka to HDFS, restores it into Hive tables, and uses Dquality for integrity, achieving millisecond latency for over 19,000 daily sync tasks handling roughly 50 TB of data.

Big DataBinlogCanal

0 likes · 13 min read

MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices

Big Data Technology & Architecture

Jun 21, 2021 · Big Data

Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases

This article provides an in‑depth overview of Apache Kylin, covering its history, mission, core MOLAP principles, technical architecture, step‑by‑step installation (Docker and Hadoop), performance tuning, advanced cube settings, and detailed case studies from major companies such as Baidu, Lianjia, and Didi.

Apache KylinCubeDocker

0 likes · 53 min read

Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases

DataFunTalk

Jun 21, 2021 · Big Data

Flink + Iceberg 0.11 Practices in Qunar Data Platform

This article shares Qunar's experience using Flink together with Apache Iceberg 0.11 to address real‑time data warehouse challenges, covering background pain points, Iceberg architecture, solutions for Kafka data loss and Hive latency, and optimization practices such as small‑file handling, sorting, and checkpoint management.

Big DataData LakeFlink

0 likes · 13 min read

Flink + Iceberg 0.11 Practices in Qunar Data Platform

DataFunTalk

Jun 11, 2021 · Big Data

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

This article explains how to efficiently move large volumes of data from Hive to HBase by leveraging HBase's bulkload mechanism, detailing the original MapReduce workflow, its performance bottlenecks, and a rewritten Spark‑based solution that simplifies ETL, improves partitioning, and achieves several‑fold speedup.

Big DataETLHBase

0 likes · 17 min read

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

dbaplus Community

Jun 1, 2021 · Big Data

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Didi migrated over 10,000 Hive SQL tasks to Spark SQL, achieving 85% Spark task share, cutting execution time by 40%, and reducing CPU and memory usage by 21% and 49% respectively, through a systematic migration process that addressed syntax, UDF, performance, and functional differences between the two engines.

Big DataHivePerformance Optimization

0 likes · 20 min read

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Big Data Technology & Architecture

May 23, 2021 · Big Data

Comprehensive Guide to Hive: Fundamentals, SQL Syntax, Performance Tuning, and Interview Preparation

This extensive article introduces Hive as a Hadoop‑based data warehouse, explains its architecture, core concepts, DDL/DML syntax, functions, performance‑optimization techniques, data‑skew handling, and provides a collection of common interview questions for Hive practitioners.

Data WarehouseHadoopHive

0 likes · 66 min read

Comprehensive Guide to Hive: Fundamentals, SQL Syntax, Performance Tuning, and Interview Preparation

HelloTech

May 14, 2021 · Big Data

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

The article describes a real‑time user behavior analysis platform built on a ClickHouse cluster, detailing its architecture, Hive‑to‑ClickHouse data ingestion with user‑ID routing, table designs for behavior and group data, and five analytical methods—event, funnel, path, retention, and attribution—leveraging shard‑level parallelism and custom functions for high efficiency.

AnalyticsBig DataClickHouse

0 likes · 20 min read

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

DataFunTalk

Apr 26, 2021 · Big Data

Detailed Design and Practical Application of Apache Iceberg at NetEase Cloud Music

This article explains the motivations behind Apache Iceberg, its design principles such as snapshot and MVCC, compares it with Hive, and describes how NetEase Cloud Music adopted Iceberg to improve metadata handling, query performance, and operational stability for massive daily log data.

Apache IcebergBig DataData Lake

0 likes · 13 min read

Detailed Design and Practical Application of Apache Iceberg at NetEase Cloud Music

Big Data Technology & Architecture

Apr 15, 2021 · Big Data

Hive and Hadoop Interview Questions and Answers

This article provides a comprehensive collection of interview-style questions and detailed answers covering Hive concepts, Hadoop architecture, MapReduce mechanics, HDFS operations, and performance optimization techniques for big‑data processing environments.

Data WarehouseHadoopHive

0 likes · 41 min read

Hive and Hadoop Interview Questions and Answers

Big Data Technology Architecture

Apr 13, 2021 · Big Data

Hive Metadata Migration and Merging Tool for Consolidating Multiple Hive Metastores

This article describes how NetEase developed a Hive metadata migration and merging tool that consolidates metadata from multiple independent Hive clusters into a single Hive metastore without moving HDFS data, detailing the challenges, ID handling, database operations, and step‑by‑step migration process.

Data MigrationHiveMetaStore

0 likes · 12 min read

Hive Metadata Migration and Merging Tool for Consolidating Multiple Hive Metastores

Big Data Technology & Architecture

Mar 23, 2021 · Big Data

Practical Implementations of Data Lakes: Huawei Production Scenario, Real-Time Financial Data Lake, and Soul's Delta Lake

This article presents a comprehensive overview of data lake implementations, detailing Huawei's production‑scene platform, a real‑time financial data lake architecture using Kafka, Flink and Iceberg, and Soul's Delta Lake practice with Spark, Hive, and custom ETL tools, highlighting design choices, processing flows, and operational considerations.

Data LakeDelta LakeFlink

0 likes · 8 min read

Practical Implementations of Data Lakes: Huawei Production Scenario, Real-Time Financial Data Lake, and Soul's Delta Lake

Big Data Technology Architecture

Mar 11, 2021 · Big Data

Challenges and Optimizations of Hive MetaStore at Kuaishou

This article details how Kuaishou tackled performance, scalability, and stability challenges of Hive MetaStore by introducing a BeaconServer hook architecture, read‑write separation, API refinements, traffic control, and federation designs, resulting in significant query efficiency and service reliability improvements.

FederationHiveRead‑Write Separation

0 likes · 14 min read

Challenges and Optimizations of Hive MetaStore at Kuaishou

DataFunTalk

Mar 10, 2021 · Big Data

Hive MetaStore Challenges and Optimizations at Kuaishou

At Kuaishou, the Hive MetaStore service, which stores metadata for Hive, faced scalability and performance challenges due to massive dynamic partitions and high query volume, leading to a series of architectural optimizations—including read‑write separation, API enhancements, traffic control, and federation—to improve stability and efficiency.

Big DataHiveKuaishou

0 likes · 15 min read

Hive MetaStore Challenges and Optimizations at Kuaishou

Big Data Technology Architecture

Mar 2, 2021 · Big Data

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how a data engineering team replaced nightly batch ETL with a Delta Lake‑based real‑time log ingestion pipeline on EMR, detailing the motivations, architecture, implementation steps, encountered issues such as data skew and schema evolution, and the practical solutions they applied to achieve low‑latency, reliable data delivery.

Delta LakeHiveSpark

0 likes · 14 min read

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

dbaplus Community

Feb 23, 2021 · Big Data

How NetEase Game Teams Built a Scalable Flink‑Based Streaming ETL Platform

This article explains how NetEase games collect heterogeneous logs, design a Flink‑driven streaming ETL pipeline, handle schema‑free sources, implement Python UDFs with Jython, optimize HDFS writes, manage real‑time and offline warehouses, and share practical tuning and fault‑tolerance techniques.

ETLFlinkHive

0 likes · 22 min read

How NetEase Game Teams Built a Scalable Flink‑Based Streaming ETL Platform

DataFunTalk

Feb 5, 2021 · Big Data

Design and Implementation of Beike's Data Management Platform (DMP)

This article details how Beike built a comprehensive Data Management Platform (DMP) that integrates user behavior and business data across multiple apps, outlines its five‑layer architecture, discusses data collection, processing, storage, real‑time profiling, and presents performance results and future optimization directions.

Big DataDMPData Engineering

0 likes · 20 min read

Design and Implementation of Beike's Data Management Platform (DMP)

Big Data Technology & Architecture

Feb 1, 2021 · Big Data

Flink 1.12 Enhancements: Full SQL Support, Hive Integration, and Streaming Write to Hive

The article reviews Flink 1.12's major enhancements, including comprehensive SQL capabilities, deep integration with Hive via catalog and streaming support, and a practical code example that demonstrates how to write streaming data into Hive tables while handling partition commits and small‑file merging.

Data IntegrationFlinkHive

0 likes · 7 min read

Flink 1.12 Enhancements: Full SQL Support, Hive Integration, and Streaming Write to Hive

Didi Tech

Jan 25, 2021 · Big Data

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

DiDi migrated over 10,000 Hive SQL tasks to Spark SQL using a lightweight dual‑run pipeline that extracts, rewrites, compares, and switches tasks, fixing syntax and UDF differences while adding features such as small‑file merging and enhanced partition pruning, resulting in Spark handling 85 % of workloads with 40 % faster execution, 21 % lower CPU and 49 % lower memory usage.

DataMigrationHivePerformance

0 likes · 18 min read

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

Big Data Technology & Architecture

Jan 10, 2021 · Big Data

Integrating Apache Flink 1.12 with Hive: Configuration, Catalog, Planner, and UDF Usage

This guide explains how to integrate Flink 1.12 with Hive using HiveCatalog, covering required dependencies, Blink planner configuration, SQL dialect switching, Hive UDF support, temporal table joins, and provides complete code snippets for a streaming‑batch unified data warehouse solution.

Blink PlannerFlinkHive

0 likes · 16 min read

Integrating Apache Flink 1.12 with Hive: Configuration, Catalog, Planner, and UDF Usage

dbaplus Community

Jan 5, 2021 · Big Data

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Facing massive, heterogeneous logs from numerous payment services, Ctrip’s data team designed a unified logging framework that extends log4j2, streams logs via Kafka to HDFS using a customized Camus pipeline, partitions and stores data in ORC for efficient Hive analysis, while addressing format, storage, and performance challenges.

Big DataCamusHadoop

0 likes · 16 min read

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Architect

Dec 27, 2020 · Big Data

Optimizing Billion‑Scale Hive Queries: Partitioning, Indexing, Bucketing, Active‑User Segmentation, and Data Structure Refactoring

This article walks through the challenges of querying a 300‑billion‑row Hive table, analyzes why traditional partitioning, indexing, and bucketing fall short, and presents a practical solution that combines active‑user segmentation and a redesigned array‑based data model to cut query time from hours to minutes.

Big DataData PartitioningHive

0 likes · 10 min read

Optimizing Billion‑Scale Hive Queries: Partitioning, Indexing, Bucketing, Active‑User Segmentation, and Data Structure Refactoring

Big Data Technology & Architecture

Dec 9, 2020 · Big Data

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

The article explains why Hive tables generate many small files on HDFS, describes the performance impact on NameNode and MapReduce, and provides detailed configuration steps and compression techniques—including input and output file merging, various Hive file formats, and partition optimization—to efficiently manage storage and resource consumption in big‑data environments.

HadoopHiveSmall Files

0 likes · 19 min read

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

Big Data Technology & Architecture

Dec 3, 2020 · Big Data

Hive Query Optimization Techniques and Best Practices

This article presents a comprehensive guide to optimizing Hive queries, covering limit adjustments, join strategies, local mode execution, parallelism, strict mode, mapper and reducer tuning, JVM reuse, dynamic partitioning, speculative execution, data skew handling, and small‑file mitigation techniques.

HiveMapReducePerformance Tuning

0 likes · 20 min read

Hive Query Optimization Techniques and Best Practices

Sohu Tech Products

Dec 2, 2020 · Big Data

Optimizing Hive SQL Lineage Parsing: Techniques, Implementation, and Practical Insights

This article presents a comprehensive overview of Hive SQL lineage parsing, detailing the challenges of data provenance in large‑scale data warehouses, introducing ANTLR‑based parsing techniques, and describing a series of optimizations—including AST pruning, CTE handling, UDF registration, and metadata service integration—to improve both table‑level and column‑level lineage extraction and visualization.

ANTLRData WarehouseHive

0 likes · 18 min read

Optimizing Hive SQL Lineage Parsing: Techniques, Implementation, and Practical Insights

Big Data Technology & Architecture

Nov 25, 2020 · Big Data

Understanding ORC File Format and Its Use in Hive and Java

This article explains the ORC (Optimized Row Columnar) file format, its advantages, internal structure, data model, compression mechanisms, and how to create Hive tables and write ORC files using Java, providing practical code examples and reference resources.

Columnar StorageData WarehouseHive

0 likes · 15 min read

Understanding ORC File Format and Its Use in Hive and Java

Big Data Technology & Architecture

Nov 10, 2020 · Big Data

Implementing CDC‑Based Data Warehouse Synchronization with Canal and Camus

This article explains how to replace daily offline MySQL‑to‑Hive sync with a CDC pipeline using Alibaba’s Canal to capture binlog events, Kafka for transport, and LinkedIn’s Camus (via a custom writer) to load data into Hive, detailing configuration and deployment steps.

CDCCamusCanal

0 likes · 14 min read

Implementing CDC‑Based Data Warehouse Synchronization with Canal and Camus

360 Tech Engineering

Nov 6, 2020 · Big Data

Guide to Flink SQL: Features, Scenarios, and Productization

Flink SQL, the high‑level SQL interface for Apache Flink, offers language‑independent, dependency‑free, easy‑to‑use stream processing with advanced features such as DDL, UDFs, time semantics, windowing, pattern matching, and built‑in connectors, supporting data synchronization, batch‑stream fusion, Hive integration, and various product enhancements.

Data IntegrationFlinkHive

0 likes · 11 min read

Guide to Flink SQL: Features, Scenarios, and Productization

Big Data Technology & Architecture

Nov 6, 2020 · Big Data

Integrating Flink SQL with Apache Zeppelin: Installation, Configuration, and Usage

This guide explains how to set up Apache Zeppelin as an interactive notebook for Flink SQL, covering download, environment configuration, Zeppelin and Flink interpreter settings on YARN, Hive integration, and step‑by‑step testing of streaming SQL queries.

FlinkHiveSQL

0 likes · 11 min read

Integrating Flink SQL with Apache Zeppelin: Installation, Configuration, and Usage

DataFunTalk

Nov 1, 2020 · Big Data

Flink 1.11 Integration with Hive: New Features and Real‑time Data Warehouse

The article explains how Flink 1.11 deepens its integration with Hive, covering background, new connector features, simplified dependency management, enhanced Hive dialect, streaming writes and reads, temporal table joins, and how these capabilities enable a unified batch‑streaming data warehouse.

Batch‑Streaming IntegrationData WarehouseFlink

0 likes · 16 min read

Flink 1.11 Integration with Hive: New Features and Real‑time Data Warehouse

Big Data Technology & Architecture

Nov 1, 2020 · Big Data

Hive Performance Tuning: Parallel Execution, Strict Mode, JVM Reuse, and Speculative Execution

This article explains Hive performance tuning techniques, including enabling parallel execution, configuring strict mode to prevent risky queries, reusing JVMs to reduce overhead, and using speculative execution to mitigate slow tasks, with configuration examples and practical considerations.

Big DataHiveJVM Reuse

0 likes · 8 min read

Hive Performance Tuning: Parallel Execution, Strict Mode, JVM Reuse, and Speculative Execution

Big Data Technology & Architecture

Oct 31, 2020 · Big Data

Hive Performance Tuning: Understanding Map and Reduce Counts

This article explains how Hive determines the number of map and reduce tasks based on input file size and block configuration, discusses when to increase or decrease map counts, and provides practical commands for adjusting reducer settings to optimize large‑scale data processing.

Big DataHiveMapReduce

0 likes · 6 min read

Hive Performance Tuning: Understanding Map and Reduce Counts

DataFunTalk

Oct 9, 2020 · Big Data

NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation

This article examines the pain points of traditional data warehouse platforms, explains the core concepts and advantages of the Iceberg data lake table format, compares it with Metastore, reviews the current Iceberg community ecosystem, and details NetEase’s practical integration with Hive, Impala, and Flink to improve ETL efficiency and support unified batch‑stream processing.

Data LakeETLFlink

0 likes · 13 min read

NetEase’s Data Lake Iceberg: Challenges, Core Principles, and Practical Implementation

Big Data Technology & Architecture

Sep 24, 2020 · Big Data

HiveSQL Classic Optimization Cases: Partitioning, Subset Decomposition, and Percentile Approximation Improvements

This article presents three HiveSQL optimization case studies—restructuring a large‑scale query with partitioned tables, breaking a complex window‑function query into smaller subsets with joins, and refactoring excessive PERCENTILE_APPROX calls—demonstrating how each change reduces execution time from hours to minutes and improves overall performance.

Big DataHiveHiveSQL

0 likes · 9 min read

HiveSQL Classic Optimization Cases: Partitioning, Subset Decomposition, and Percentile Approximation Improvements

Big Data Technology & Architecture

Sep 2, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Features, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark to ingest, manage, and incrementally query large analytical datasets on HDFS‑compatible storage, offering features such as timeline management, copy‑on‑write and merge‑on‑read tables, and support for snapshot, incremental, and read‑optimized queries across engines like Hive, Spark SQL and Presto.

Apache HudiBig DataData Lake

0 likes · 12 min read

An Overview of Apache Hudi: Architecture, Features, and Query Types

dbaplus Community

Sep 1, 2020 · Big Data

Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive

This article presents a systematic guide to real‑time MySQL binlog ingestion, outlining three core principles—decoupling from business data, handling schema changes, and ensuring traceability—followed by concrete Debezium‑Kafka‑Hive solutions, scenario‑specific tactics, and practical tips for reliable data pipelines.

DebeziumHiveKafka

0 likes · 15 min read

Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive

Big Data Technology & Architecture

Aug 31, 2020 · Big Data

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

This article provides a comprehensive guide on integrating Hive with Spark SQL, covering Hive‑on‑Spark and Spark‑on‑Hive setups, spark‑shell and spark‑sql usage, HiveServer2 with Beeline, Scala scripts for reading and writing Hive tables, and partition handling for aggregated results.

Big DataData IntegrationHive

0 likes · 7 min read

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

Big Data Technology & Architecture

Aug 23, 2020 · Big Data

Integrating Flink 1.11 with Hive Streaming, Kafka, and Table API

This article demonstrates how to use Flink 1.11's enhanced Hive integration to stream data from a Kafka source, write it into partitioned Hive tables with checkpoint‑driven commits, and read Hive tables as a continuous source using dynamic table options and table hints.

Big DataFlinkHive

0 likes · 13 min read

Integrating Flink 1.11 with Hive Streaming, Kafka, and Table API

Big Data Technology Architecture

Aug 13, 2020 · Big Data

iQIYI’s Adoption of Apache Kylin for OLAP: Architecture, Optimizations, and Future Plans

The article details iQIYI’s migration from a Hive + MySQL OLAP stack to Apache Kylin, describing the system’s architecture, typical use cases, performance gains, independent HBase deployment, service platform for monitoring, and future plans such as automated cube building and clustering.

Apache KylinCubeHBase

0 likes · 13 min read

iQIYI’s Adoption of Apache Kylin for OLAP: Architecture, Optimizations, and Future Plans

Huolala Tech

Aug 4, 2020 · Big Data

How to Accelerate Hive UDFs by Caching Large Geo Data: A 140× Speed Boost

To dramatically improve Hive UDF performance when converting coordinates to administrative districts, this article compares two implementation strategies, details the technical challenges of repeatedly loading a 157 MB Geo data file, and presents a static‑cached solution that reduces query time from seconds to milliseconds, achieving roughly a 140‑fold speed increase.

HivePerformance OptimizationStatic Caching

0 likes · 15 min read

How to Accelerate Hive UDFs by Caching Large Geo Data: A 140× Speed Boost

Big Data Technology & Architecture

Jul 30, 2020 · Big Data

Understanding Bucket Sampling Queries in Hive

This article explains Hive's bucket sampling syntax, demonstrates how to use the TABLESAMPLE clause with various bucket parameters, provides concrete SQL examples, and clarifies the underlying hash‑based mechanism that determines which rows are returned.

Big DataBucket SamplingHive

0 likes · 4 min read

Understanding Bucket Sampling Queries in Hive

Big Data Technology & Architecture

Jul 29, 2020 · Big Data

Sqoop Tutorial: Importing and Exporting Data between Relational Databases, HDFS, Hive, and HBase

This article provides a comprehensive guide to using Sqoop for importing data from relational databases into HDFS, Hive, and HBase, as well as exporting data back to databases, covering command syntax, options, and practical examples for big‑data workflows.

Big DataHBaseHDFS

0 likes · 8 min read

Sqoop Tutorial: Importing and Exporting Data between Relational Databases, HDFS, Hive, and HBase

Big Data Technology & Architecture

Jul 19, 2020 · Big Data

An Overview of Hive, HBase Integration, Apache Phoenix, and Lealone in the Big Data Ecosystem

This article explains Hive's role as a Hadoop‑based data warehouse, its integration with HBase, the advantages and drawbacks of that combination, introduces Apache Phoenix as a high‑performance SQL layer on HBase, and describes the open‑source NewSQL database Lealone, providing practical usage scenarios and performance comparisons.

Big DataData WarehouseHBase

0 likes · 9 min read

An Overview of Hive, HBase Integration, Apache Phoenix, and Lealone in the Big Data Ecosystem

Youzan Coder

Jul 1, 2020 · Big Data

Mastering HiveCube: Efficient Multi‑Dimensional Aggregation with Grouping Sets

This article explains how HiveCube can replace traditional development for multi‑dimensional aggregation in a data‑warehouse, covering background, theory of cube, with‑cube/rollup/grouping‑sets syntax, grouping_id handling, practical implementation tips, performance tuning, and a comparison with conventional methods.

Big DataCubeData Warehouse

0 likes · 19 min read

Mastering HiveCube: Efficient Multi‑Dimensional Aggregation with Grouping Sets

Big Data Technology & Architecture

Jun 15, 2020 · Big Data

Hive Optimization Techniques and Best Practices for Big Data Processing

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, tuning MapReduce jobs, handling data skew in joins, and selecting appropriate storage formats for large‑scale data warehouses.

Big DataData SkewHive

0 likes · 19 min read

Hive Optimization Techniques and Best Practices for Big Data Processing

Big Data Technology & Architecture

Jun 13, 2020 · Big Data

SQL Approach to Identify Continuous User Activity Periods in Big Data

This article demonstrates how to use SQL, including dense_rank and date arithmetic, to detect users who have recorded a specific event for at least seven consecutive days within a month, providing step‑by‑step queries and a complete combined statement.

HiveSQLcontinuous days

0 likes · 5 min read

SQL Approach to Identify Continuous User Activity Periods in Big Data

Big Data Technology Architecture

May 15, 2020 · Big Data

Performance Tuning of Hive on Spark in YARN Mode

This article explains how to optimize Hive on Spark running on YARN, covering YARN node resource configuration, Spark executor and driver memory settings, dynamic allocation, parallelism, and key Hive parameters to achieve superior performance compared to Hive on MapReduce.

Cluster ConfigurationHivePerformance Tuning

0 likes · 11 min read

Performance Tuning of Hive on Spark in YARN Mode

Big Data Technology & Architecture

Apr 25, 2020 · Big Data

Integrating SparkSQL with Hive: Configuration, MetaStore Setup, and Example Scala Code

This article explains the differences between Spark on Hive and Hive on Spark, then provides step‑by‑step instructions for configuring Hive MetaStore, setting up SparkSQL to use Hive, and demonstrates a complete Scala program that creates a Hive table, loads data, and queries it.

Big DataData IntegrationHive

0 likes · 7 min read

Integrating SparkSQL with Hive: Configuration, MetaStore Setup, and Example Scala Code

Big Data Technology & Architecture

Apr 9, 2020 · Big Data

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

The article explains how applying filters, projections, and predicate pushdown in Hadoop and Hive reduces data volume, speeds up MapReduce jobs, and improves performance, while also covering join limitations and providing a Java Mapper example for practical implementation.

Big DataHadoopHive

0 likes · 4 min read

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

Big Data Technology & Architecture

Apr 8, 2020 · Big Data

Spark Job Execution Principles and Parameter Tuning for Hive on Spark

This article explains how Spark jobs run on YARN, describes the impact of stages, shuffle and task parallelism, and provides detailed recommendations for tuning Spark executor, memory, core, and parallelism settings to dramatically improve Hive‑on‑Spark TPCx‑BB benchmark performance on large datasets.

Big DataHiveSpark

0 likes · 12 min read

Spark Job Execution Principles and Parameter Tuning for Hive on Spark

Big Data Technology & Architecture

Apr 2, 2020 · Big Data

Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets

This article demonstrates how to create Hive tables for student, course, teacher, and score data, generate CSV files, load them into Hive, and provides a comprehensive set of Hive SQL queries covering data retrieval, aggregation, ranking, and statistical analysis for educational datasets.

Big DataData WarehouseHive

0 likes · 21 min read

Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets

Big Data Technology Architecture

Mar 19, 2020 · Big Data

Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations

Data skew in Hive MapReduce jobs, caused by uneven key distribution during joins, group‑by, or COUNT(DISTINCT) operations, can severely slow tasks, and the article explains common scenarios and practical solutions such as using MapJoin, enabling map‑side aggregation, load‑balancing, and rewriting queries to mitigate skew.

Data SkewHiveMapJoin

0 likes · 7 min read

Handling Data Skew in Hive: Join, Group By, and COUNT(DISTINCT) Optimizations

Youzan Coder

Mar 18, 2020 · Big Data

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

AirflowBig DataData Governance

0 likes · 20 min read

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

Big Data Technology & Architecture

Mar 8, 2020 · Big Data

Hive on Spark Tuning Parameters and Best Practices

This article explains how to tune Hive on Spark by adjusting driver, executor, and Hive configuration parameters—including CPU cores, memory allocations, dynamic allocation, and join thresholds—to achieve optimal performance when running on YARN.

Big DataHivePerformance Tuning

0 likes · 7 min read

Hive on Spark Tuning Parameters and Best Practices

Ctrip Technology

Feb 20, 2020 · Big Data

Ctrip Flight Ticket Data Warehouse: Architecture, Technology Stack, and Practical Practices

This article outlines Ctrip's flight ticket data warehouse evolution, current big‑data technology stack, data synchronization methods, layered architecture, quality monitoring system, and a real‑time price anomaly detection case, providing practical insights for building scalable, reliable data warehousing solutions.

CtripData QualityData Warehouse

0 likes · 20 min read

Ctrip Flight Ticket Data Warehouse: Architecture, Technology Stack, and Practical Practices

DataFunTalk

Feb 19, 2020 · Big Data

Design and Integration of Flink Batch Processing with Hive: Architecture, Features, and Performance Evaluation

This article presents the design of Flink's batch processing architecture, its integration with Hive through a unified Catalog API, details the enhancements in Flink 1.10, outlines future work, and reports a performance test showing roughly seven‑fold speedup over Hive on MapReduce.

Big DataCatalog APIFlink

0 likes · 9 min read

Design and Integration of Flink Batch Processing with Hive: Architecture, Features, and Performance Evaluation

Big Data Technology & Architecture

Jan 13, 2020 · Big Data

Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration

This article explains the ORC (Optimized Record Columnar) file format used in Hive, covering its architecture, stripe and column storage, handling of complex data types, indexing mechanisms, compression streams, memory management, and key configuration parameters.

Big DataFile FormatHive

0 likes · 14 min read

Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration

dbaplus Community

Jan 1, 2020 · Big Data

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

Facebook migrated a massive, multi‑stage Hive‑based entity ranking pipeline to a single Spark job, detailing the challenges of scaling to 20 TB inputs, the reliability fixes, performance optimizations, and the resulting 4‑6× CPU speedup and reduced latency.

Big DataHiveReliability

0 likes · 16 min read

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

ITPUB

Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataHivePerformance Optimization

0 likes · 16 min read

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

ITPUB

Dec 17, 2019 · Databases

Mastering LEFT JOIN: Common Pitfalls and Practical Solutions

This article explains the fundamentals of LEFT JOIN in SQL, illustrates one‑to‑one, one‑to‑many, and many‑to‑many scenarios, compares ON versus WHERE conditions, and provides concrete MySQL and Hive examples with code snippets and visual diagrams to avoid common mistakes.

HiveJoin TypesLEFT JOIN

0 likes · 14 min read

Mastering LEFT JOIN: Common Pitfalls and Practical Solutions

ITPUB

Dec 9, 2019 · Fundamentals

Master Date Operations in pandas and SQL: Retrieval, Conversion, and Calculation

This tutorial walks through loading order data into pandas and SQL, then demonstrates how to retrieve current dates, extract date components, convert between readable dates and Unix timestamps, transform between 10‑digit and 8‑digit date formats, and perform date arithmetic using pandas, MySQL, and Hive.

HiveMySQLPandas

0 likes · 16 min read

Master Date Operations in pandas and SQL: Retrieval, Conversion, and Calculation

Big Data Technology & Architecture

Oct 22, 2019 · Big Data

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

This article explains how to design and implement a real‑time data verification framework using Flink to generate wide tables, storing detailed records in Elasticsearch or HDFS with Hive for cross‑checking against offline data, ensuring trustworthy metrics for dashboards and stakeholders.

Big DataData verificationElasticsearch

0 likes · 7 min read

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

Big Data Technology & Architecture

Oct 8, 2019 · Big Data

Real‑time MySQL Binlog Capture and Offline Hive Restoration for Data Warehouse Production

This article describes a complete solution that uses Alibaba's Canal for real‑time MySQL binlog collection, Kafka for transport, and a customized Camus pipeline to load and merge binlog data into Hive, addressing performance, consistency, and delete‑event challenges in large‑scale data warehousing.

BinlogCamusCanal

0 likes · 12 min read

Real‑time MySQL Binlog Capture and Offline Hive Restoration for Data Warehouse Production

Big Data Technology & Architecture

Sep 13, 2019 · Big Data

Differences and Relationship Between HBase and Hive in Big Data Architecture

The article explains that HBase and Hive occupy distinct roles in big‑data systems—HBase handles real‑time random queries on massive detail data, while Hive provides batch‑oriented SQL‑based processing on HDFS—and describes how they are typically combined in a data pipeline.

Big DataData ArchitectureHBase

0 likes · 5 min read

Differences and Relationship Between HBase and Hive in Big Data Architecture

Big Data Technology & Architecture

Jul 17, 2019 · Big Data

How to Write Spark DataFrames to Hive Tables and Partitions

This article explains how to persist Spark DataFrames into Hive tables and specific partitions, covering the relevant write APIs, the need to select a database, and providing step‑by‑step Scala code examples for both Spark 1.6 and Spark 2.x versions, along with Hive table creation syntax.

Big DataHiveSQL

0 likes · 10 min read

How to Write Spark DataFrames to Hive Tables and Partitions

Big Data Technology Architecture

Jul 16, 2019 · Big Data

Optimizing HBase‑to‑Hive Data Transfer with SnapshotScanMR to Reduce RegionServer Load

The article describes how a large‑scale ETL process that previously used HBaseStorageHandler caused severe region server pressure, and how a new HBase‑to‑Hive task based on SnapshotScanMR was designed to bypass region servers, halve execution time, and double scanning performance.

ETLHBaseHive

0 likes · 6 min read

Optimizing HBase‑to‑Hive Data Transfer with SnapshotScanMR to Reduce RegionServer Load

dbaplus Community

Jul 10, 2019 · Big Data

How Kuaishou Scales SQL on Hadoop: Architecture, Optimizations, and Lessons Learned

This article explains the SQL‑on‑Hadoop ecosystem—including Hive, Spark, SparkSQL, Presto and other solutions—then details Kuaishou's large‑scale platform architecture, performance bottlenecks, routing logic, high‑availability mechanisms, and a series of concrete optimizations that improve query speed, resource utilization, and operational stability.

High AvailabilityHiveSQL on Hadoop

0 likes · 19 min read

How Kuaishou Scales SQL on Hadoop: Architecture, Optimizations, and Lessons Learned

Zhongtong Tech

Jul 5, 2019 · Big Data

How SnapshotScanMR Doubles HBase‑to‑Hive ETL Speed and Relieves Cluster Load

This article explains how leveraging HBase's SnapshotScanMR feature to create a custom hbase2hiveBySnapshot task dramatically reduces region server pressure, halves ETL execution time, and improves cluster stability for large‑scale data back‑fill operations.

Big DataETLHBase

0 likes · 6 min read

How SnapshotScanMR Doubles HBase‑to‑Hive ETL Speed and Relieves Cluster Load

Big Data Technology & Architecture

Jun 24, 2019 · Big Data

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, fine‑tuning join operations, and optimizing MapReduce parameters such as mapper/reducer counts, file merging, compression, JVM reuse, parallel execution, strict mode, and storage formats.

Big DataHiveMapReduce

0 likes · 19 min read

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

Big Data Technology & Architecture

Jun 17, 2019 · Big Data

Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples

This article introduces Spark SQL fundamentals, including its architecture, DataFrame and Dataset abstractions, query methods, interoperability with RDD, user-defined functions, integration with Hive, data source handling, and provides step‑by‑step Scala code examples for loading data, performing aggregations, and solving common analytical tasks.

DataFramesHiveSQL

0 likes · 15 min read

Understanding Spark SQL: Concepts, Queries, Data Sources, and Practical Examples

Big Data Technology Architecture

Jun 9, 2019 · Big Data

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

This article provides a comprehensive overview of Apache Parquet, covering its purpose, architectural components, nested data model, file structure, practical Hive commands for creating and inspecting Parquet tables, and a brief introduction to the TPC‑DS benchmark for performance testing.

Columnar StorageHiveParquet

0 likes · 8 min read

An Introduction to Apache Parquet: Architecture, Data Model, File Format, and Basic Operations

Tencent Cloud Developer

May 21, 2019 · Information Security

Design and Implementation of a Cloud Audit Solution for Tencent Cloud Accounts

The article details a scalable, extensible cloud‑audit architecture for Tencent Cloud accounts that stores API logs in a Shanghai‑region COS bucket, processes them with EMR‑based Hive tables and hourly partition scripts, aggregates results into a hot MySQL store, and enables administrators to monitor all sub‑accounts with a real‑time “god view.”

COSEMRHive

0 likes · 13 min read

Design and Implementation of a Cloud Audit Solution for Tencent Cloud Accounts

Big Data Technology & Architecture

Apr 24, 2019 · Big Data

Hive SQL Optimization Techniques and Best Practices

This article provides a comprehensive guide to Hive SQL performance tuning, covering optimization goals, common pitfalls, execution flow, table and job settings, map, shuffle, reduce, and query-level improvements such as join, bucket join, group‑by, and count‑distinct optimizations.

Big DataHadoopHive

0 likes · 11 min read

Hive SQL Optimization Techniques and Best Practices

Big Data Technology & Architecture

Apr 23, 2019 · Databases

Implementing Row-to-Column Pivot in Hive: Traditional and Map Approaches

This article explains how to perform row-to-column transformations (pivot) in Hive using two methods: a traditional SQL approach mimicking Oracle/SQL Server pivot syntax and a more concise map-based technique, comparing their syntax, performance, and memory considerations.

Big DataHivePivot

0 likes · 3 min read

Implementing Row-to-Column Pivot in Hive: Traditional and Map Approaches