Tagged articles
178 articles
Page 2 of 2
JD Cloud Developers
JD Cloud Developers
Dec 23, 2020 · Artificial Intelligence

How JD Cloud’s Federated Learning Platform Breaks Data Silos for AI

The article explains how JD Cloud’s federated learning platform enables secure, privacy‑preserving collaborative AI across isolated data sources by using encrypted distributed training, flexible model architectures, and a range of algorithms, while highlighting its architecture, security mechanisms, deployment speed, and real‑world industry successes.

AIJD Clouddata privacy
0 likes · 10 min read
How JD Cloud’s Federated Learning Platform Breaks Data Silos for AI
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 19, 2020 · Big Data

Apache Kylin Principles, Architecture, and Real-World Applications in Baidu Maps, Lianjia, and Didi

This article explains Apache Kylin’s core principles and technical architecture, then details how major Chinese companies such as Baidu Maps, Lianjia, and Didi have deployed Kylin for large‑scale OLAP, describing their system designs, performance results, and the challenges they encountered.

Apache KylinCubeData Warehouse
0 likes · 16 min read
Apache Kylin Principles, Architecture, and Real-World Applications in Baidu Maps, Lianjia, and Didi
Architect
Architect
Dec 13, 2020 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical scenarios that cause it, and provides practical strategies and platform‑specific optimizations to detect, mitigate, and prevent skew in big‑data processing pipelines.

HadoopSparkdistributed computing
0 likes · 13 min read
Understanding and Solving Data Skew in Hadoop and Spark
Meituan Technology Team
Meituan Technology Team
Nov 26, 2020 · Artificial Intelligence

Meituan Autonomous Vehicle Engine: Architecture, Challenges, and Resource Optimization

Meituan’s autonomous vehicle engine abstracts communication, scheduling, data handling, and tooling into three layers to ensure deterministic behavior across on‑vehicle and simulation environments, tackling consistency, scheduling, and resource‑utilization challenges by using unified computation models, distributed graph deployment, caching, and remote model serving, thereby accelerating autonomous delivery vehicle development.

AIResource OptimizationScheduling
0 likes · 14 min read
Meituan Autonomous Vehicle Engine: Architecture, Challenges, and Resource Optimization
Tencent Cloud Developer
Tencent Cloud Developer
Nov 13, 2020 · Big Data

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Apache SparkBig DataDAG scheduler
0 likes · 17 min read
Apache Spark Core: Architecture, Components, and Execution Flow
JD Tech Talk
JD Tech Talk
Oct 30, 2020 · Cloud Computing

Federated Learning, Edge Computing, and Cloud Computing: Concepts, Applications, and Comparative Analysis

This article introduces federated learning, edge computing, and cloud computing, explains each technology's principles and use cases, and then compares their similarities and differences, highlighting privacy‑preserving collaborative modeling, near‑source processing, and centralized resource provisioning.

ComparisonEdge ComputingFederated Learning
0 likes · 8 min read
Federated Learning, Edge Computing, and Cloud Computing: Concepts, Applications, and Comparative Analysis
Tencent Cloud Developer
Tencent Cloud Developer
Oct 19, 2020 · Big Data

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

By parallelizing Spark’s driver‑side commit, trash, and move phases—previously single‑threaded operations that caused costly copy‑on‑rename when writing massive files to object storage—the Tencent Cloud EMR case achieved over a tenfold (1,100 %) speedup, making object storage a viable alternative to HDFS.

Big DataEMRPerformance Optimization
0 likes · 8 min read
Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR
Big Data Technology Architecture
Big Data Technology Architecture
Aug 5, 2020 · Big Data

Understanding Join Execution in Spark SQL

This article explains how Spark SQL processes joins—including inner, outer, semi, and anti joins—by describing the overall query planning flow, the three physical join strategies (sort‑merge, broadcast, and hash), and the specific implementation details for each join type.

DataFramesJOINSQL Optimization
0 likes · 10 min read
Understanding Join Execution in Spark SQL
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 3, 2020 · Big Data

Understanding Join Implementations in Spark SQL

This article explains the various join types supported by Spark SQL, describes the overall Spark SQL execution flow, and details the physical implementation processes of inner, outer, semi, anti, broadcast, sort‑merge, and hash joins, helping developers grasp how joins are executed in a distributed environment.

JOINdataframedistributed computing
0 likes · 12 min read
Understanding Join Implementations in Spark SQL
Didi Tech
Didi Tech
Jul 24, 2020 · Artificial Intelligence

DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks

DLFlow, an end‑to‑end framework from Didi’s user‑profile team, merges Spark and TensorFlow to automate feature preprocessing, large‑scale distributed training, and massive prediction for big‑data offline tasks, offering configuration‑driven pipelines, task scheduling, and easy deployment that dramatically speeds model development.

Deep LearningModel DevelopmentSpark
0 likes · 9 min read
DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks
Youzan Coder
Youzan Coder
Apr 1, 2020 · Big Data

Presto Implementation and Practice at YouZan: A Big Data Query Engine Journey

The article outlines Presto’s high‑performance, coordinator‑worker architecture and query flow, describes YouZan’s migration from mixed Hadoop deployment to dedicated low‑latency clusters, details challenges such as small‑file handling and regex backtracking with their fixes, and previews future enhancements like Alluxio integration, session property managers, and Ranger‑based multi‑tenant isolation.

FacebookHDFSPerformance Optimization
0 likes · 14 min read
Presto Implementation and Practice at YouZan: A Big Data Query Engine Journey
58 Tech
58 Tech
Mar 26, 2020 · Big Data

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

The article introduces LPA-Detector, an open‑source project that redesigns the Label Propagation Algorithm using Spark GraphX to add node confidence weights and relationship influence, achieving significant improvements in execution efficiency and detection accuracy for massive graph data in risk‑control scenarios.

Big DataRisk DetectionSpark
0 likes · 8 min read
LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection
Architects' Tech Alliance
Architects' Tech Alliance
Feb 23, 2020 · Cloud Computing

Edge Computing and Its Relationship with 5G: Concepts, Value, Applications, and Future Outlook

This article explains edge computing, its distributed architecture, key advantages such as higher security, lower latency and reduced bandwidth costs, explores major application scenarios like smart manufacturing and autonomous driving, and analyzes how 5G both drives and benefits from edge computing development.

5GEdge ComputingIoT
0 likes · 11 min read
Edge Computing and Its Relationship with 5G: Concepts, Value, Applications, and Future Outlook
DataFunTalk
DataFunTalk
Dec 24, 2019 · Big Data

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

This article explains PySpark's multi‑process architecture, how the Python driver uses Py4J to call Java/Scala APIs, the implementation of RDD and DataFrame interfaces, executor‑side process communication and serialization with Arrow, and the design of Pandas UDFs, while also discussing current limitations and future directions.

ArrowBig DataPySpark
0 likes · 13 min read
Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 3, 2019 · Big Data

Understanding SparkEnv Initialization: Components and Their Setup

This article walks through the SparkEnv initialization process in Apache Spark, detailing how the driver and executor environments are created, the key components such as SecurityManager, RpcEnv, SerializerManager, BroadcastManager, MapOutputTracker, ShuffleManager, MemoryManager, BlockManager, MetricsSystem, and OutputCommitCoordinator are instantiated, and how the final SparkEnv instance is assembled and stored.

Big DataScalaSpark
0 likes · 13 min read
Understanding SparkEnv Initialization: Components and Their Setup
Tencent Cloud Developer
Tencent Cloud Developer
Jul 18, 2019 · Big Data

Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform

Tencent’s iData analysis center selected Spark as its new computing platform because, unlike ElasticSearch, TiDB, and other MPP solutions, Spark offers iterative processing, shuffle support, robust SQL and DAG scheduling, and flexible SMP‑style data exchange, enabling efficient OLAP on billions of game‑user records.

Big DataData PlatformMPP
0 likes · 13 min read
Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform
Sohu Tech Products
Sohu Tech Products
Feb 13, 2019 · Big Data

Evolution and Implementation Details of Spark Shuffle Mechanisms

This article examines the historical evolution of Spark's shuffle implementations—from early Hash‑Based Shuffle to modern SortShuffleWriter, BypassMergeSortShuffleWriter, and UnsafeShuffleWriter—explaining their design choices, selection criteria, and the corresponding shuffle reader architecture in a production‑grade Spark 2.1.1 environment.

Big DataShuffleShuffle Writer
0 likes · 13 min read
Evolution and Implementation Details of Spark Shuffle Mechanisms
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 18, 2019 · Artificial Intelligence

How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning

Euler, Alibaba's newly open‑sourced graph deep‑learning framework, combines distributed graph processing with neural network training to handle billions of nodes and edges, supports heterogeneous graphs, offers built‑in algorithms, and has already boosted advertising, fraud detection, and other industry applications.

AI InfrastructureEuler frameworkdistributed computing
0 likes · 11 min read
How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning
JD Tech
JD Tech
Jan 11, 2019 · Big Data

Spark Memory Management and Tuning Practices for Large-Scale Billing Systems

This article explains how Spark's memory management models and configuration parameters can be tuned to handle massive billing data efficiently, covering StaticMemoryManager vs UnifiedMemoryManager, storage and shuffle memory fractions, common OOM and file‑not‑found issues, and practical performance‑optimisation tips.

Memory ManagementSparkdistributed computing
0 likes · 9 min read
Spark Memory Management and Tuning Practices for Large-Scale Billing Systems
Architects Research Society
Architects Research Society
Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

ApacheBig DataFlink
0 likes · 22 min read
Overview of Major Apache Big Data Processing Frameworks
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 18, 2018 · Databases

Inside Alibaba AnalyticDB: Architecture, Core Technologies, and Real‑Time Data Warehouse Innovations

This article provides an in‑depth technical overview of Alibaba's AnalyticDB, covering the challenges of massive real‑time analytics, the cloud‑native multi‑tenant architecture, data model, import/export capabilities, high‑performance SQL parser, the Xuanwu storage engine, Xihe compute engine, optimizer, GPU acceleration, and elastic scaling features.

AnalyticDBGPU AccelerationSQL Parser
0 likes · 38 min read
Inside Alibaba AnalyticDB: Architecture, Core Technologies, and Real‑Time Data Warehouse Innovations
Tencent Cloud Developer
Tencent Cloud Developer
Oct 30, 2018 · Big Data

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

The article reviews recent big-data trends—from Hadoop’s evolution and Spark’s in-memory advances to emerging storage like Ozone—while detailing data-warehouse models, query-optimizer techniques, and cloud-native architectures that integrate diverse data sources, enabling scalable, AI-ready analytics and modern data-lake capabilities.

Big DataData LakeData Warehouse
0 likes · 30 min read
Big Data Technology Trends and Cloud Data Warehouse Architecture Practices
Java Backend Technology
Java Backend Technology
Oct 13, 2018 · Big Data

Check a New Integer Among 4 Billion Records in Seconds Using Bitmap & Distributed Methods

An interviewee faces the challenge of determining whether a newly given integer exists within a set of 4 billion numbers, and the article explores efficient solutions—from naive disk‑I/O approaches to distributed processing and the memory‑saving bitmap technique—highlighting their performance trade‑offs and implementation details.

Big DataBitmapalgorithm
0 likes · 6 min read
Check a New Integer Among 4 Billion Records in Seconds Using Bitmap & Distributed Methods
Architects' Tech Alliance
Architects' Tech Alliance
Oct 9, 2018 · Fundamentals

Parallel Computing vs Distributed Computing: Concepts, Principles, and Differences

This article explains the concepts, principles, and key distinctions between parallel computing and distributed computing, describing their objectives, basic conditions, advantages, and typical use cases within high‑performance computing, and highlights how they differ from grid and cloud computing.

HPCcomputing fundamentalsdistributed computing
0 likes · 6 min read
Parallel Computing vs Distributed Computing: Concepts, Principles, and Differences
DataFunTalk
DataFunTalk
Aug 14, 2018 · Artificial Intelligence

Machine Learning and Deep Learning Engineering Practices at Ping An Life

The article summarizes senior AI expert Wu Jianjun’s presentation on machine‑learning and deep‑learning engineering at Ping An Life, detailing the company’s big‑data platform, data processing pipelines, model training frameworks, distributed computing strategies, and production model‑serving architecture for financial applications.

Deep LearningModel Servingdistributed computing
0 likes · 15 min read
Machine Learning and Deep Learning Engineering Practices at Ping An Life
Alibaba Cloud Developer
Alibaba Cloud Developer
Jul 23, 2018 · Big Data

How Alibaba’s MaxCompute Became the Backbone of 99% Data Processing

This article reviews Alibaba's MaxCompute evolution from ODPS to a unified, multi‑cluster big‑data platform, detailing its architecture, development tools, large‑scale deployments, performance optimizations, typical workload scenarios, and why it is the preferred choice for enterprise data processing.

Alibaba CloudBig DataData Platform
0 likes · 22 min read
How Alibaba’s MaxCompute Became the Backbone of 99% Data Processing
High Availability Architecture
High Availability Architecture
May 21, 2018 · Big Data

Interview with Baidu’s Chief Big Data Architect Ma Ruyue on OLAP, HTAP, and Emerging Big Data Technologies

In this interview, Baidu’s senior big‑data architect Ma Ruyue discusses his career transition from Hadoop to online databases, the design philosophy behind Baidu’s Palo ROLAP system, the future of HTAP, and his views on the evolving big‑data ecosystem including Spark, AI, and containerization.

Data ArchitectureHTAPOLAP
0 likes · 11 min read
Interview with Baidu’s Chief Big Data Architect Ma Ruyue on OLAP, HTAP, and Emerging Big Data Technologies
21CTO
21CTO
May 17, 2018 · Big Data

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

This article explains Hadoop's core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker architecture to improve scalability and resource utilization in large‑scale data processing clusters.

HadoopShuffleYARN
0 likes · 12 min read
Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling
Meituan Technology Team
Meituan Technology Team
Mar 22, 2018 · Big Data

High-Performance User Behavior Analysis Solution for Massive Data

The paper describes a high‑performance user‑behavior analysis system that processes hundreds of billions of daily logs for Meituan‑Dianping, using an inverted‑index structure with bitmap UUID sets and timestamp sequences, combined with Spark, Spring and Alluxio optimizations to cut query times from hours to under five seconds.

Big DataOLAP analysisdistributed computing
0 likes · 14 min read
High-Performance User Behavior Analysis Solution for Massive Data
Architecture Digest
Architecture Digest
Feb 28, 2018 · Blockchain

Blockchain Infrastructure Landscape: A First‑Principles Framework

This article presents a first‑principles framework that categorizes blockchain infrastructure components—storage, computation, and communication—by mapping them to concrete projects such as Ethereum, IPFS, BigchainDB, and others, illustrating how these modules interoperate to build efficient decentralized applications.

BlockchainInfrastructuredecentralized storage
0 likes · 21 min read
Blockchain Infrastructure Landscape: A First‑Principles Framework
vivo Internet Technology
vivo Internet Technology
Jan 31, 2018 · Big Data

Predicate Pushdown Rules in SparkSql Inner Join Queries

SparkSql optimizes inner‑join queries by pushing predicates to the scan phase, allowing filters connected with AND to be applied before the join without changing results, while OR‑connected filters can be unsafe except when they involve the join key or partitioned tables which use partition pruning.

JOIN optimizationPredicate PushdownSQL Optimization
0 likes · 10 min read
Predicate Pushdown Rules in SparkSql Inner Join Queries
dbaplus Community
dbaplus Community
Jun 7, 2017 · Big Data

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

This comprehensive guide walks you through MapReduce fundamentals, the complete execution flow, and seven hands‑on Hadoop projects—including WordCount, custom serialization, custom partitioning, grouping comparators, file merging, multiple outputs, join operations, and friend‑graph analysis—while providing environment setup steps, Maven commands, and Hadoop CLI examples.

CLIHadoopJava
0 likes · 28 min read
Master MapReduce: From Fundamentals to Real‑World Hadoop Projects
Java High-Performance Architecture
Java High-Performance Architecture
Apr 4, 2017 · Big Data

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

This tutorial quickly introduces the MapReduce model, explains its core principles and execution flow, and guides you through seven practical examples—from basic WordCount to custom serialization, partitioning, joins, and friend‑recommendation—while providing test data and an optional ready‑made Hadoop environment for hands‑on practice.

HadoopMapReduceTutorial
0 likes · 3 min read
Master MapReduce: Principles, Process, and 7 Hands‑On Examples
ITPUB
ITPUB
Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataMapReduceRDD
0 likes · 25 min read
Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jan 24, 2017 · Big Data

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

This article provides a comprehensive overview of Hadoop as the leading open‑source platform for big‑data processing, detailing its core components HDFS and MapReduce, the evolution to Hadoop 2.0/YARN, and the extensive ecosystem of tools and commercial solutions that enable scalable storage, analysis, and machine‑learning on massive data sets.

Big DataHDFSHadoop
0 likes · 18 min read
Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends
Architecture Digest
Architecture Digest
Jan 24, 2017 · Artificial Intelligence

TensorFlow: Large‑Scale Machine Learning on Heterogeneous Distributed Systems – Overview and Implementation

TensorFlow is a dataflow‑based programming model for large‑scale machine learning that uses directed acyclic graphs to represent computations, supports single‑device, multi‑device, and distributed execution with sophisticated node placement, communication, fault‑tolerance, and optimization techniques, and provides tools such as TensorBoard for visualization.

Dataflow GraphParallelismTensorFlow
0 likes · 13 min read
TensorFlow: Large‑Scale Machine Learning on Heterogeneous Distributed Systems – Overview and Implementation
Architecture Digest
Architecture Digest
Nov 16, 2016 · Big Data

A Decade of Hadoop: History, Architecture, Ecosystem, and Future Outlook

This article chronicles Hadoop’s ten‑year evolution from its early HDFS and MapReduce roots to a mature big‑data platform, detailing its historical milestones, architectural layers, ecosystem components, industry adoption, and future trends in storage, processing, security, and cloud integration.

EcosystemHadoopdistributed computing
0 likes · 36 min read
A Decade of Hadoop: History, Architecture, Ecosystem, and Future Outlook
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Oct 8, 2016 · Big Data

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

Data warehouses centralize and transform enterprise data for multidimensional analysis, and modern demands have spawned four types—traditional, real‑time, associative discovery, and data marts—each with distinct technical requirements, while Hadoop‑based solutions like Transwarp Data Hub address challenges of scale, variety, latency, and security.

Big DataHadoopReal-time analytics
0 likes · 21 min read
Evolving Data Warehouses with Hadoop & Spark: Core Technologies
Meituan Technology Team
Meituan Technology Team
Aug 5, 2016 · Big Data

Takeaway Big Data Optimization Strategies

Meituan’s delivery team leverages massive real‑time data mining, distributed parallel optimization and a simulation platform to intelligently assign orders, boost rider efficiency, cut costs, and continuously refine algorithms, while integrating offline‑online learning and upstream‑downstream coordination to enhance overall logistics performance.

AI in deliveryLogistics Optimizationdistributed computing
0 likes · 9 min read
Takeaway Big Data Optimization Strategies
Architecture Digest
Architecture Digest
Jul 5, 2016 · Big Data

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

The article reviews Hadoop’s origins from Google’s pioneering papers, explains its architecture and ecosystem, evaluates its strengths such as scalability and benchmarks, discusses current limitations like single‑point failures and complex programming, and outlines upcoming improvements including HDFS Federation and next‑generation MapReduce.

Big DataFutureHDFS
0 likes · 14 min read
Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop
21CTO
21CTO
Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

RDDSparkSparkSQL
0 likes · 17 min read
How Spark Runs on YARN: From Client Submission to Executor Execution
Architecture Digest
Architecture Digest
Apr 9, 2016 · Big Data

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

This article describes how Meituan migrated from Hive‑SQL and MapReduce to Spark on YARN, built an interactive Zeppelin‑based development platform, created reusable ETL templates, constructed a Spark‑driven feature and data‑mining platform, and applied Spark to interactive user‑behavior analysis and large‑scale SEM services, highlighting performance gains and operational benefits.

Big DataData PlatformETL
0 likes · 19 min read
Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications
ITPUB
ITPUB
Feb 20, 2016 · Big Data

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

The article chronicles Doug Cutting’s path from his Stanford studies and early Xerox work through the creation of Lucene, Nutch, and Hadoop, highlighting how open‑source innovations and Google’s technologies propelled Hadoop to become a cornerstone of modern big‑data processing and its future outlook.

Big DataDoug CuttingHadoop
0 likes · 15 min read
Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era
ITPUB
ITPUB
Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataHiveSparkSQL
0 likes · 13 min read
How SparkSQL Executes Queries Faster Than Hive: A Deep Dive
21CTO
21CTO
Nov 26, 2015 · Big Data

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

This article explores the 4V characteristics of big data, real‑world data growth examples, historical analogies, Google’s GFS‑MapReduce‑BigTable model, Hadoop’s architecture and HDFS processes, HBase components, NoSQL alternatives, and practical big‑data applications at Tencent and beyond.

Data ArchitectureHadoopMapReduce
0 likes · 7 min read
Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

Leveraging Bitmap for High‑Performance Multi‑Dimensional Analytics in Big Data

This article explains how bitmap data structures, combined with compression and in‑memory techniques, enable fast, flexible, and scalable multi‑dimensional analytics for large‑scale data platforms, addressing historical marketing inefficiencies and outlining future directions such as memory‑mapped files and distributed bitmap computation.

BitmapMulti-dimensional Analyticscompression
0 likes · 19 min read
Leveraging Bitmap for High‑Performance Multi‑Dimensional Analytics in Big Data
21CTO
21CTO
Aug 11, 2015 · Big Data

Understanding MapReduce Through a Pizza Sauce Analogy

The author recounts delivering a MapReduce talk, then uses a vivid pizza sauce preparation story to illustrate how mapping chops ingredients and reducing blends them, effectively explaining distributed data processing concepts to a non‑technical audience.

AnalogyMapReducedata-processing
0 likes · 7 min read
Understanding MapReduce Through a Pizza Sauce Analogy
Efficient Ops
Efficient Ops
Jun 25, 2015 · Big Data

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

BaiduDAGHadoop
0 likes · 21 min read
Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing
ITPUB
ITPUB
Oct 30, 2014 · Big Data

Inside Fourinone: A Lightweight Distributed Framework Challenging Hadoop

The interview with Fourinone founder Peng Yuan explores the framework's evolution from a parallel computing project to a 220 KB distributed system with its own NoSQL database engine CoolHash, compares it to Hadoop, and discusses its open‑source release, technical design choices, and real‑world deployments in finance and enterprise environments.

Big DataCoolHashFourinone
0 likes · 31 min read
Inside Fourinone: A Lightweight Distributed Framework Challenging Hadoop