Tagged articles
232 articles
Page 3 of 3
Beike Product & Technology
Beike Product & Technology
Feb 21, 2019 · Big Data

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

The article presents an in‑depth overview of the DATABUS data integration platform, detailing its background, current challenges, core capabilities such as data syncing, metadata automation, real‑time subscriptions, and its reliance on TiDB, TiSpark, Hudi, and related big‑data technologies to enable near‑real‑time data warehousing.

Big DataData IntegrationHudi
0 likes · 13 min read
DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem
dbaplus Community
dbaplus Community
Jan 23, 2019 · Big Data

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

This article explains Zhihu's journey from ad‑hoc MySQL‑Hive sync using Oozie + Sqoop to a unified, platform‑based data synchronization service that now handles thousands of tables, over 10 TB daily, with load‑aware scheduling, incremental pulls, schema change handling, and tight integration with their offline job scheduler.

Big DataDataXETL
0 likes · 14 min read
How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX
ITPUB
ITPUB
Dec 10, 2018 · Big Data

How Meituan Syncs MySQL to Hive in Real-Time Using Binlog, Canal, and Camus

This article explains Meituan's architecture for accurately and efficiently moving MySQL data into a Hive data warehouse by capturing binlog streams with Canal, transporting them via Kafka, and restoring them offline with Camus and a merge process that handles inserts, updates, and deletes.

BinlogKafkahive
0 likes · 14 min read
How Meituan Syncs MySQL to Hive in Real-Time Using Binlog, Canal, and Camus
Programmer DD
Programmer DD
Nov 7, 2018 · Big Data

Choosing the Right SQL Engine for Big Data: A Practical Guide

This article explores various SQL engines and storage options for big‑data workloads, compares their performance and capabilities, shows practical code examples, and offers guidance on writing efficient SQL in complex data environments.

Big DataSQL Enginesdata engineering
0 likes · 6 min read
Choosing the Right SQL Engine for Big Data: A Practical Guide
Meitu Technology
Meitu Technology
Aug 14, 2018 · Big Data

Meitu Data Platform Architecture and Practices

Meitu’s data platform, serving dozens of apps with 500 million monthly active users and billions of daily events, combines the Arachnia log‑collection system, Kafka ingestion, multi‑layer storage (HDFS, MongoDB, HBase, Elasticsearch), offline Hive/MapReduce processing and real‑time Storm/Flink/Naix pipelines, supported by data‑workshop tools, staged evolution for scalability, and robust security and query‑validation mechanisms.

Big DataData PlatformETL
0 likes · 16 min read
Meitu Data Platform Architecture and Practices
Youzan Coder
Youzan Coder
Aug 3, 2018 · Big Data

Youzan Data Warehouse Metadata System: From Manual Tables to Metadata‑Driven Architecture

Youzan’s data‑warehouse metadata system evolved from manually maintained tables to an automated data dictionary and finally to a metadata‑driven architecture that automatically captures technical, business, and process metadata, visualizes lineage, tracks resource usage, manages synchronization rules and permissions, and now aims to improve novice usability with visual models and impact‑analysis tools.

Big DataLineageResource Monitoring
0 likes · 11 min read
Youzan Data Warehouse Metadata System: From Manual Tables to Metadata‑Driven Architecture
360 Quality & Efficiency
360 Quality & Efficiency
Jun 28, 2018 · Big Data

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

This article provides a concise overview of Apache Hive, covering its definition, Hadoop background, architecture, query workflow, storage model, advantages, disadvantages, and a comparison with traditional relational databases, helping readers understand how Hive enables SQL-like queries on data stored in HDFS.

HadoopMapReducedata-warehouse
0 likes · 5 min read
An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases
ITPUB
ITPUB
Jun 4, 2018 · Big Data

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

Despite Gartner's 2017 claim that Hadoop is nearing the end of its production maturity, a series of interviews with Chinese big‑data experts reveal that Hadoop's ecosystem remains robust, with core components like HDFS, YARN, Spark, and HBase continuing to dominate the market.

Big DataEcosystemGartner
0 likes · 9 min read
Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong
dbaplus Community
dbaplus Community
Mar 7, 2018 · Big Data

Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization

The article outlines a systematic approach for large‑scale Hadoop clusters to monitor daily data growth, identify abnormal paths, manage rapid expansion, clean unused cold data, and implement capacity forecasts, while providing concrete daily and quarterly actions, Hive‑specific strategies, and practical examples to keep storage under control.

Big DataData GrowthHDFS
0 likes · 17 min read
Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization
37 Interactive Technology Team
37 Interactive Technology Team
Oct 19, 2017 · Big Data

Ambari Technical Practice for Managing Hadoop Big Data Platforms

The team adopted Apache Ambari to streamline deployment, scaling, monitoring, and upgrade of their Hadoop‑centric big‑data platform, overcoming HA cluster takeover and custom Hive 2.1 integration through a three‑phase test, gray‑scale, and production rollout, thereby improving management efficiency and reducing O&M costs.

AmbariCluster ManagementHDP
0 likes · 18 min read
Ambari Technical Practice for Managing Hadoop Big Data Platforms
21CTO
21CTO
Sep 25, 2017 · Big Data

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

This article explains how Meitu built and evolved a large‑scale data statistics platform to handle billions of users, detailing the challenges of growing data volume, the architectural shifts from simple scripts to Hadoop, and the design of modular components for job management, scheduling, execution, and future expansion.

Big DataData PlatformHadoop
0 likes · 16 min read
How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons
Qunar Tech Salon
Qunar Tech Salon
Sep 25, 2017 · Big Data

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

This article provides an extensive overview of Apache Spark’s ecosystem—including its data‑warehouse capabilities, ML/MLlib libraries, streaming with Spark Streaming, external frameworks, and real‑world enterprise case studies—while also noting a promotional announcement for a React Native conference.

Big DataKafkaSpark
0 likes · 21 min read
Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases
21CTO
21CTO
Jun 9, 2017 · Big Data

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

This guide walks beginners through the essential big‑data ecosystem—from understanding Hadoop’s core components and mastering MapReduce, to using Hive, SparkSQL, Kafka, and real‑time frameworks like Storm, while also covering data ingestion, export, scheduling, and introductory machine‑learning techniques.

Big DataSparkdata engineering
0 likes · 20 min read
From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect
Architecture Digest
Architecture Digest
Jun 9, 2017 · Big Data

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

This guide walks beginners through the entire big‑data ecosystem, covering the 4V characteristics, core open‑source frameworks, Hadoop setup, Hive and SQL on Hadoop, data ingestion and export tools, task scheduling, real‑time processing with Kafka, Storm and Spark Streaming, and an introduction to machine‑learning applications.

HadoopKafkaSpark
0 likes · 17 min read
A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning
MaGe Linux Operations
MaGe Linux Operations
May 24, 2017 · Big Data

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

This article explains how big data challenges traditional storage, introduces HDFS for distributed file management, describes parallel processing frameworks like MapReduce, Tez, and Spark, compares higher‑level tools such as Hive and Pig, and explores real‑time streaming and key‑value stores for low‑latency analytics.

HadoopMapReduceSpark
0 likes · 9 min read
Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming
MaGe Linux Operations
MaGe Linux Operations
May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopMapReduce
0 likes · 13 min read
From Storage to Real‑Time: The Evolution of Big Data Technologies
Java High-Performance Architecture
Java High-Performance Architecture
Oct 21, 2016 · Big Data

What Is Hive and How Does It Turn SQL into MapReduce?

This article explains Hive as a SQL‑based interface for Hadoop, shows why it simplifies large‑scale data analysis, provides practical command‑line examples for table creation, data loading, and queries, and details how HiveQL is internally converted into MapReduce jobs.

MapReducedata-warehousehive
0 likes · 6 min read
What Is Hive and How Does It Turn SQL into MapReduce?
Architect
Architect
May 6, 2016 · Big Data

Integrating Kylin, Mondrian, and Saiku to Build an OLAP Analysis Tool

This article describes how the Youzan data team combined Apache Kylin, Mondrian, and Saiku into a three‑layer OLAP system, covering background, component overviews, technical architecture, schema integration challenges, count‑distinct handling, Kylin‑specific SQL quirks, and practical solutions.

Big DataHBaseKylin
0 likes · 12 min read
Integrating Kylin, Mondrian, and Saiku to Build an OLAP Analysis Tool
ITPUB
ITPUB
Apr 24, 2016 · Big Data

12 Essential Hive Performance Tips for Faster Hadoop Queries

This guide presents twelve practical Hive tuning techniques—including avoiding MapReduce, limiting string concatenation, steering clear of subqueries, choosing the right file formats, managing vectorization, sizing containers, enabling statistics, and optimizing joins—to dramatically improve query speed on Hadoop.

Big DataHadoophive
0 likes · 7 min read
12 Essential Hive Performance Tips for Faster Hadoop Queries
21CTO
21CTO
Feb 1, 2016 · Big Data

How Solr Supercharges Real‑Time Queries in Big Data Environments

This article examines a real‑world case from Alibaba’s Taobao Jushita platform, showing how traditional SQL queries struggle with multi‑dimensional, high‑volume data and how integrating Solr’s inverted‑index search engine—combined with Hive‑generated wide tables and custom QParser plugins—delivers millisecond‑level, scalable query performance for buyer analytics.

Big DataReal-time QuerySolr
0 likes · 11 min read
How Solr Supercharges Real‑Time Queries in Big Data Environments
ITPUB
ITPUB
Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataSparkSQLdistributed computing
0 likes · 13 min read
How SparkSQL Executes Queries Faster Than Hive: A Deep Dive
ITPUB
ITPUB
May 26, 2015 · Big Data

Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop

This article provides a concise, practical walkthrough for installing and configuring Apache Hive on a Hadoop cluster, covering prerequisite HDFS and MapReduce setup, downloading Hive, extracting files, setting environment variables, configuring XML files, starting Hive, and verifying the installation with simple commands.

ConfigurationETLHQL
0 likes · 4 min read
Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop