Tagged articles

Hive

236 articles · Page 3 of 3

Apr 21, 2019 · Big Data

Overview of Hive Data Warehouse, Its Architecture, Query Processing, and Comparison with Impala

This article provides a comprehensive overview of Hive as a Hadoop‑based data warehouse, explains its architecture, query‑to‑MapReduce translation, high‑availability design, and compares its batch‑oriented processing with Impala's low‑latency SQL engine for big data analytics.

Big DataData WarehouseHigh Availability

0 likes · 15 min read

Overview of Hive Data Warehouse, Its Architecture, Query Processing, and Comparison with Impala

Big Data Technology & Architecture

Apr 20, 2019 · Big Data

Weekly Hadoop Knowledge Points: Compression Formats, MapReduce Join, Hive Setup, and YARN Capacity Scheduler

This weekly bulletin summarizes four Hadoop knowledge points—compression formats, MapReduce join techniques, Hive installation, and YARN Capacity Scheduler—while also sharing personal updates about a PhD graduation, the upcoming May Day holiday, and a request for likes and shares.

Big DataHadoopHive

0 likes · 2 min read

Weekly Hadoop Knowledge Points: Compression Formats, MapReduce Join, Hive Setup, and YARN Capacity Scheduler

Big Data Technology & Architecture

Apr 17, 2019 · Big Data

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

This tutorial provides a comprehensive, step-by-step procedure for setting up Hive 2.1.0 on a Hadoop 2.7.1 cluster running Ubuntu 14.04, covering environment preparation, Hive installation, configuration of environment variables, MySQL metastore integration, client setup, service startup, and basic verification commands.

Big DataHadoopHive

0 likes · 8 min read

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

Youzan Coder

Mar 22, 2019 · Big Data

Design and Implementation of a DataX‑Based Data Synchronization Platform at Youzan

Youzan replaced Sqoop with a customized DataX‑based platform that integrates with its offline scheduler to reliably sync MySQL, HBase, Elasticsearch and file sources to Hive, handling schema changes, sharding, rate‑limiting and logging, and has processed billions of rows daily with high stability.

DataXETLHive

0 likes · 15 min read

Design and Implementation of a DataX‑Based Data Synchronization Platform at Youzan

Beike Product & Technology

Feb 21, 2019 · Big Data

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

The article presents an in‑depth overview of the DATABUS data integration platform, detailing its background, current challenges, core capabilities such as data syncing, metadata automation, real‑time subscriptions, and its reliance on TiDB, TiSpark, Hudi, and related big‑data technologies to enable near‑real‑time data warehousing.

Big DataData IntegrationHive

0 likes · 13 min read

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

dbaplus Community

Jan 23, 2019 · Big Data

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

This article explains Zhihu's journey from ad‑hoc MySQL‑Hive sync using Oozie + Sqoop to a unified, platform‑based data synchronization service that now handles thousands of tables, over 10 TB daily, with load‑aware scheduling, incremental pulls, schema change handling, and tight integration with their offline job scheduler.

Big DataData synchronizationDataX

0 likes · 14 min read

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

360 Quality & Efficiency

Jan 2, 2019 · Big Data

Understanding ETL and Data Warehouses: A Beginner’s Guide

This article introduces the fundamentals of Business Intelligence, explains what ETL and data warehouses are, compares them with traditional databases, and outlines the main characteristics and popular tools such as Hive used in modern big‑data environments.

BIBig DataData Integration

0 likes · 5 min read

Understanding ETL and Data Warehouses: A Beginner’s Guide

Big Data Technology & Architecture

Dec 31, 2018 · Big Data

Overview of the Big Data Ecosystem and Core Technologies

This article provides a comprehensive overview of the big data ecosystem, explaining key components such as Hadoop, HDFS, Spark, Hive, Pig, HBase, and related tools, and describes how they work together to store, process, and analyze massive datasets efficiently.

Big DataHadoopHive

0 likes · 16 min read

Overview of the Big Data Ecosystem and Core Technologies

ITPUB

Dec 10, 2018 · Big Data

How Meituan Syncs MySQL to Hive in Real-Time Using Binlog, Canal, and Camus

This article explains Meituan's architecture for accurately and efficiently moving MySQL data into a Hive data warehouse by capturing binlog streams with Canal, transporting them via Kafka, and restoring them offline with Camus and a merge process that handles inserts, updates, and deletes.

BinlogHiveKafka

0 likes · 14 min read

How Meituan Syncs MySQL to Hive in Real-Time Using Binlog, Canal, and Camus

Meituan Technology Team

Dec 6, 2018 · Big Data

Real-time Binlog Collection and Offline MySQL Data Restoration for Data Warehousing

The article presents a CDC solution that combines Alibaba’s Canal for real‑time MySQL binlog capture into Kafka with LinkedIn’s Camus for hourly Kafka‑to‑Hive loading, then merges snapshots and incremental binlog data to accurately and efficiently rebuild ODS tables, supporting sharding and delete events.

BinlogCDCCamus

0 likes · 14 min read

Real-time Binlog Collection and Offline MySQL Data Restoration for Data Warehousing

Programmer DD

Nov 7, 2018 · Big Data

Choosing the Right SQL Engine for Big Data: A Practical Guide

This article explores various SQL engines and storage options for big‑data workloads, compares their performance and capabilities, shows practical code examples, and offers guidance on writing efficient SQL in complex data environments.

Big DataData EngineeringHive

0 likes · 6 min read

Choosing the Right SQL Engine for Big Data: A Practical Guide

360 Quality & Efficiency

Sep 25, 2018 · Databases

Comparison of Hive, MongoDB, and Redis: Features, Use Cases, and Characteristics

This article provides a concise overview of three data storage solutions—Hive, MongoDB, and Redis—detailing their core concepts, operational principles, typical use cases, and key characteristics to help developers choose the appropriate technology for various workloads.

HiveNoSQLRedis

0 likes · 6 min read

Comparison of Hive, MongoDB, and Redis: Features, Use Cases, and Characteristics

Meitu Technology

Aug 14, 2018 · Big Data

Meitu Data Platform Architecture and Practices

Meitu’s data platform, serving dozens of apps with 500 million monthly active users and billions of daily events, combines the Arachnia log‑collection system, Kafka ingestion, multi‑layer storage (HDFS, MongoDB, HBase, Elasticsearch), offline Hive/MapReduce processing and real‑time Storm/Flink/Naix pipelines, supported by data‑workshop tools, staged evolution for scalability, and robust security and query‑validation mechanisms.

Big DataData EngineeringData Platform

0 likes · 16 min read

Meitu Data Platform Architecture and Practices

dbaplus Community

Aug 6, 2018 · Big Data

Understanding RAID, HDFS, and MapReduce: From Storage to Distributed Computing

This article explains the storage challenges of big data, introduces RAID levels and their trade‑offs, describes the HDFS architecture with NameNode and DataNode replication, details the MapReduce programming model and execution flow, and shows how Hive translates SQL queries into MapReduce jobs.

Big DataDistributed ComputingHDFS

0 likes · 23 min read

Understanding RAID, HDFS, and MapReduce: From Storage to Distributed Computing

Youzan Coder

Aug 3, 2018 · Big Data

Youzan Data Warehouse Metadata System: From Manual Tables to Metadata‑Driven Architecture

Youzan’s data‑warehouse metadata system evolved from manually maintained tables to an automated data dictionary and finally to a metadata‑driven architecture that automatically captures technical, business, and process metadata, visualizes lineage, tracks resource usage, manages synchronization rules and permissions, and now aims to improve novice usability with visual models and impact‑analysis tools.

Big DataData WarehouseHive

0 likes · 11 min read

Youzan Data Warehouse Metadata System: From Manual Tables to Metadata‑Driven Architecture

360 Quality & Efficiency

Jun 28, 2018 · Big Data

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

This article provides a concise overview of Apache Hive, covering its definition, Hadoop background, architecture, query workflow, storage model, advantages, disadvantages, and a comparison with traditional relational databases, helping readers understand how Hive enables SQL-like queries on data stored in HDFS.

Data WarehouseHadoopHive

0 likes · 5 min read

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

dbaplus Community

Jun 14, 2018 · Big Data

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

This article explains how enterprises can build a scalable data analytics platform on Hadoop by outlining the multi‑layer architecture, storage options, data synchronization methods, and ETL/offline computation techniques, while highlighting practical component choices such as Hive, HBase, Spark, and Oozie.

Big DataData ArchitectureData Lake

0 likes · 10 min read

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

ITPUB

Jun 4, 2018 · Big Data

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

Despite Gartner's 2017 claim that Hadoop is nearing the end of its production maturity, a series of interviews with Chinese big‑data experts reveal that Hadoop's ecosystem remains robust, with core components like HDFS, YARN, Spark, and HBase continuing to dominate the market.

Big DataGartnerHadoop

0 likes · 9 min read

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

dbaplus Community

Mar 20, 2018 · Big Data

How to Upgrade Hive from 0.13 to 2.1 Without Downtime: Tips, Pitfalls, and Best Practices

This article walks through a gray‑scale, controlled upgrade of Hive from version 0.13 to 2.1, covering metadata schema analysis, syntax compatibility, new Hive‑2.1 features, UDF adjustments, performance tweaks, and a step‑by‑step procedure to ensure stability and zero service interruption.

Big DataHiveMetadata

0 likes · 20 min read

How to Upgrade Hive from 0.13 to 2.1 Without Downtime: Tips, Pitfalls, and Best Practices

dbaplus Community

Mar 7, 2018 · Big Data

Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization

The article outlines a systematic approach for large‑scale Hadoop clusters to monitor daily data growth, identify abnormal paths, manage rapid expansion, clean unused cold data, and implement capacity forecasts, while providing concrete daily and quarterly actions, Hive‑specific strategies, and practical examples to keep storage under control.

Big DataData GrowthHDFS

0 likes · 17 min read

Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization

37 Interactive Technology Team

Oct 19, 2017 · Big Data

Ambari Technical Practice for Managing Hadoop Big Data Platforms

The team adopted Apache Ambari to streamline deployment, scaling, monitoring, and upgrade of their Hadoop‑centric big‑data platform, overcoming HA cluster takeover and custom Hive 2.1 integration through a three‑phase test, gray‑scale, and production rollout, thereby improving management efficiency and reducing O&M costs.

AmbariHDPHadoop

0 likes · 18 min read

Ambari Technical Practice for Managing Hadoop Big Data Platforms

21CTO

Sep 25, 2017 · Big Data

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

This article explains how Meitu built and evolved a large‑scale data statistics platform to handle billions of users, detailing the challenges of growing data volume, the architectural shifts from simple scripts to Hadoop, and the design of modular components for job management, scheduling, execution, and future expansion.

Big DataData PlatformHadoop

0 likes · 16 min read

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

Qunar Tech Salon

Sep 25, 2017 · Big Data

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

This article provides an extensive overview of Apache Spark’s ecosystem—including its data‑warehouse capabilities, ML/MLlib libraries, streaming with Spark Streaming, external frameworks, and real‑world enterprise case studies—while also noting a promotional announcement for a React Native conference.

Big DataData WarehouseHive

0 likes · 21 min read

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

21CTO

Jun 9, 2017 · Big Data

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

This guide walks beginners through the essential big‑data ecosystem—from understanding Hadoop’s core components and mastering MapReduce, to using Hive, SparkSQL, Kafka, and real‑time frameworks like Storm, while also covering data ingestion, export, scheduling, and introductory machine‑learning techniques.

Big DataData EngineeringHive

0 likes · 20 min read

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

Architecture Digest

Jun 9, 2017 · Big Data

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

This guide walks beginners through the entire big‑data ecosystem, covering the 4V characteristics, core open‑source frameworks, Hadoop setup, Hive and SQL on Hadoop, data ingestion and export tools, task scheduling, real‑time processing with Kafka, Storm and Spark Streaming, and an introduction to machine‑learning applications.

HadoopHiveKafka

0 likes · 17 min read

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

MaGe Linux Operations

May 24, 2017 · Big Data

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

This article explains how big data challenges traditional storage, introduces HDFS for distributed file management, describes parallel processing frameworks like MapReduce, Tez, and Spark, compares higher‑level tools such as Hive and Pig, and explores real‑time streaming and key‑value stores for low‑latency analytics.

HadoopHiveKey-Value Store

0 likes · 9 min read

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

Architects' Tech Alliance

May 7, 2017 · Big Data

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

This guide walks beginners through the entire big‑data ecosystem—explaining the 4V characteristics, listing essential open‑source components, teaching Hadoop setup, Hive and SparkSQL usage, data ingestion with Sqoop, Flume and Kafka, task scheduling with Oozie, and real‑time processing with Storm and Spark Streaming.

Big DataHadoopHive

0 likes · 20 min read

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

MaGe Linux Operations

May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopHive

0 likes · 13 min read

From Storage to Real‑Time: The Evolution of Big Data Technologies

StarRing Big Data Open Lab

Jan 13, 2017 · Databases

How Inceptor’s Cost‑Based Optimizer Boosts SQL Performance and How to Use It

This article explains the concept of Cost‑Based Optimization (CBO) in Inceptor, details how to collect statistics with ANALYZE or the preanalyze script, shows how to enable CBO, and presents performance gains demonstrated on TPC‑DS benchmarks.

CBOCost-Based OptimizationHive

0 likes · 11 min read

How Inceptor’s Cost‑Based Optimizer Boosts SQL Performance and How to Use It

Java High-Performance Architecture

Oct 21, 2016 · Big Data

What Is Hive and How Does It Turn SQL into MapReduce?

This article explains Hive as a SQL‑based interface for Hadoop, shows why it simplifies large‑scale data analysis, provides practical command‑line examples for table creation, data loading, and queries, and details how HiveQL is internally converted into MapReduce jobs.

Data WarehouseHiveMapReduce

0 likes · 6 min read

What Is Hive and How Does It Turn SQL into MapReduce?

Architect

May 6, 2016 · Big Data

Integrating Kylin, Mondrian, and Saiku to Build an OLAP Analysis Tool

This article describes how the Youzan data team combined Apache Kylin, Mondrian, and Saiku into a three‑layer OLAP system, covering background, component overviews, technical architecture, schema integration challenges, count‑distinct handling, Kylin‑specific SQL quirks, and practical solutions.

Big DataHBaseHive

0 likes · 12 min read

Integrating Kylin, Mondrian, and Saiku to Build an OLAP Analysis Tool

ITPUB

Apr 24, 2016 · Big Data

12 Essential Hive Performance Tips for Faster Hadoop Queries

This guide presents twelve practical Hive tuning techniques—including avoiding MapReduce, limiting string concatenation, steering clear of subqueries, choosing the right file formats, managing vectorization, sizing containers, enabling statistics, and optimizing joins—to dramatically improve query speed on Hadoop.

Big DataHadoopHive

0 likes · 7 min read

12 Essential Hive Performance Tips for Faster Hadoop Queries

21CTO

Feb 1, 2016 · Big Data

How Solr Supercharges Real‑Time Queries in Big Data Environments

This article examines a real‑world case from Alibaba’s Taobao Jushita platform, showing how traditional SQL queries struggle with multi‑dimensional, high‑volume data and how integrating Solr’s inverted‑index search engine—combined with Hive‑generated wide tables and custom QParser plugins—delivers millisecond‑level, scalable query performance for buyer analytics.

Big DataHiveReal-time Query

0 likes · 11 min read

How Solr Supercharges Real‑Time Queries in Big Data Environments

ITPUB

Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataDistributed ComputingHive

0 likes · 13 min read

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

Art of Distributed System Architecture Design

Jul 10, 2015 · Big Data

Improving Hive Storage Efficiency: From RCFile to ORCFile at Facebook

Facebook’s data warehouse, storing over 300 PB and growing by 600 TB daily, transitioned from the RCFile format to an optimized ORCFile implementation, achieving 5‑8× better compression and up to three‑fold faster write performance while maintaining high read efficiency.

Big DataFacebookHive

0 likes · 14 min read

Improving Hive Storage Efficiency: From RCFile to ORCFile at Facebook

ITPUB

May 26, 2015 · Big Data

Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop

This article provides a concise, practical walkthrough for installing and configuring Apache Hive on a Hadoop cluster, covering prerequisite HDFS and MapReduce setup, downloading Hive, extracting files, setting environment variables, configuring XML files, starting Hive, and verifying the installation with simple commands.

ETLHQLHadoop

0 likes · 4 min read

Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop