Tagged articles

178 articles

Page 2 of 2

Mar 23, 2021 · Big Data

From MapReduce to Ray: The Evolution of Big Data Computing Engines and Career Opportunities

This article traces the history of big‑data computing engines—from early MapReduce and Hadoop through Spark, Storm, Flink, and the newer Ray—explaining their technical advances, real‑world applications in AI and finance, and why graduates should consider a career in this rapidly evolving field.

AIBig DataRay

0 likes · 16 min read

From MapReduce to Ray: The Evolution of Big Data Computing Engines and Career Opportunities

Python Programming Learning Circle

Feb 25, 2021 · Big Data

Parallel Computing and Python Multiprocessing: Concepts, Models, and Practical Examples

This article explains the fundamentals of parallel computing in the big‑data era, compares parallelism and concurrency, outlines GPU and distributed‑computing solutions, and provides a detailed guide to Python’s multiprocessing module with code examples, performance tests, and practical tips.

Big DataGPUPython

0 likes · 18 min read

Parallel Computing and Python Multiprocessing: Concepts, Models, and Practical Examples

Full-Stack Internet Architecture

Jan 27, 2021 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

This article provides a comprehensive overview of Hadoop, covering its origins, core components such as HDFS, MapReduce, and YARN, their architectures, data storage and processing mechanisms, fault‑tolerance features, scheduling strategies, and practical optimization techniques for large‑scale distributed computing.

Big DataHDFSHadoop

0 likes · 33 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

JD Cloud Developers

Dec 23, 2020 · Artificial Intelligence

How JD Cloud’s Federated Learning Platform Breaks Data Silos for AI

The article explains how JD Cloud’s federated learning platform enables secure, privacy‑preserving collaborative AI across isolated data sources by using encrypted distributed training, flexible model architectures, and a range of algorithms, while highlighting its architecture, security mechanisms, deployment speed, and real‑world industry successes.

AIJD Clouddata privacy

0 likes · 10 min read

How JD Cloud’s Federated Learning Platform Breaks Data Silos for AI

Big Data Technology & Architecture

Dec 19, 2020 · Big Data

Apache Kylin Principles, Architecture, and Real-World Applications in Baidu Maps, Lianjia, and Didi

This article explains Apache Kylin’s core principles and technical architecture, then details how major Chinese companies such as Baidu Maps, Lianjia, and Didi have deployed Kylin for large‑scale OLAP, describing their system designs, performance results, and the challenges they encountered.

Apache KylinCubeData Warehouse

0 likes · 16 min read

Architect

Dec 13, 2020 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical scenarios that cause it, and provides practical strategies and platform‑specific optimizations to detect, mitigate, and prevent skew in big‑data processing pipelines.

HadoopSparkdistributed computing

0 likes · 13 min read

Understanding and Solving Data Skew in Hadoop and Spark

Big Data Technology & Architecture

Dec 6, 2020 · Big Data

Integrating Spark with MongoDB: Architecture, Use Cases, and Code Samples

This article explains how Spark can be combined with MongoDB for large‑scale data processing, covering Spark fundamentals, comparisons with HDFS, practical integration patterns, performance benefits, real‑world case studies, and detailed code examples for deployment and analytics.

Data IntegrationMongoDBPerformance Optimization

0 likes · 18 min read

Integrating Spark with MongoDB: Architecture, Use Cases, and Code Samples

Meituan Technology Team

Nov 26, 2020 · Artificial Intelligence

Meituan Autonomous Vehicle Engine: Architecture, Challenges, and Resource Optimization

Meituan’s autonomous vehicle engine abstracts communication, scheduling, data handling, and tooling into three layers to ensure deterministic behavior across on‑vehicle and simulation environments, tackling consistency, scheduling, and resource‑utilization challenges by using unified computation models, distributed graph deployment, caching, and remote model serving, thereby accelerating autonomous delivery vehicle development.

AIResource OptimizationScheduling

0 likes · 14 min read

Meituan Autonomous Vehicle Engine: Architecture, Challenges, and Resource Optimization

Tencent Cloud Developer

Nov 13, 2020 · Big Data

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Apache SparkBig DataDAG scheduler

0 likes · 17 min read

Apache Spark Core: Architecture, Components, and Execution Flow

JD Tech Talk

Oct 30, 2020 · Cloud Computing

Federated Learning, Edge Computing, and Cloud Computing: Concepts, Applications, and Comparative Analysis

This article introduces federated learning, edge computing, and cloud computing, explains each technology's principles and use cases, and then compares their similarities and differences, highlighting privacy‑preserving collaborative modeling, near‑source processing, and centralized resource provisioning.

ComparisonEdge ComputingFederated Learning

0 likes · 8 min read

Federated Learning, Edge Computing, and Cloud Computing: Concepts, Applications, and Comparative Analysis

Tencent Cloud Developer

Oct 19, 2020 · Big Data

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

By parallelizing Spark’s driver‑side commit, trash, and move phases—previously single‑threaded operations that caused costly copy‑on‑rename when writing massive files to object storage—the Tencent Cloud EMR case achieved over a tenfold (1,100 %) speedup, making object storage a viable alternative to HDFS.

Big DataEMRPerformance Optimization

0 likes · 8 min read

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

Big Data Technology Architecture

Aug 5, 2020 · Big Data

Understanding Join Execution in Spark SQL

This article explains how Spark SQL processes joins—including inner, outer, semi, and anti joins—by describing the overall query planning flow, the three physical join strategies (sort‑merge, broadcast, and hash), and the specific implementation details for each join type.

DataFramesJOINSQL Optimization

0 likes · 10 min read

Understanding Join Execution in Spark SQL

Big Data Technology & Architecture

Aug 3, 2020 · Big Data

Understanding Join Implementations in Spark SQL

This article explains the various join types supported by Spark SQL, describes the overall Spark SQL execution flow, and details the physical implementation processes of inner, outer, semi, anti, broadcast, sort‑merge, and hash joins, helping developers grasp how joins are executed in a distributed environment.

JOINdataframedistributed computing

0 likes · 12 min read

Understanding Join Implementations in Spark SQL

Didi Tech

Jul 24, 2020 · Artificial Intelligence

DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks

DLFlow, an end‑to‑end framework from Didi’s user‑profile team, merges Spark and TensorFlow to automate feature preprocessing, large‑scale distributed training, and massive prediction for big‑data offline tasks, offering configuration‑driven pipelines, task scheduling, and easy deployment that dramatically speeds model development.

Deep LearningModel DevelopmentSpark

0 likes · 9 min read

DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks

Youzan Coder

Apr 1, 2020 · Big Data

Presto Implementation and Practice at YouZan: A Big Data Query Engine Journey

The article outlines Presto’s high‑performance, coordinator‑worker architecture and query flow, describes YouZan’s migration from mixed Hadoop deployment to dedicated low‑latency clusters, details challenges such as small‑file handling and regex backtracking with their fixes, and previews future enhancements like Alluxio integration, session property managers, and Ranger‑based multi‑tenant isolation.

FacebookHDFSPerformance Optimization

0 likes · 14 min read

Presto Implementation and Practice at YouZan: A Big Data Query Engine Journey

58 Tech

Mar 26, 2020 · Big Data

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

The article introduces LPA-Detector, an open‑source project that redesigns the Label Propagation Algorithm using Spark GraphX to add node confidence weights and relationship influence, achieving significant improvements in execution efficiency and detection accuracy for massive graph data in risk‑control scenarios.

Big DataRisk DetectionSpark

0 likes · 8 min read

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

Architecture Digest

Mar 11, 2020 · Big Data

Apache Flink: Unified Stream and Batch Processing Architecture and Core Concepts

This article provides a comprehensive overview of Apache Flink, explaining how it unifies stream and batch processing on a single runtime, detailing its key features, APIs, libraries, architectural components, fault‑tolerance mechanisms, scheduling, iterative processing, and back‑pressure monitoring.

Apache FlinkBatch Processingbackpressure

0 likes · 20 min read

Apache Flink: Unified Stream and Batch Processing Architecture and Core Concepts

Architects' Tech Alliance

Feb 23, 2020 · Cloud Computing

Edge Computing and Its Relationship with 5G: Concepts, Value, Applications, and Future Outlook

This article explains edge computing, its distributed architecture, key advantages such as higher security, lower latency and reduced bandwidth costs, explores major application scenarios like smart manufacturing and autonomous driving, and analyzes how 5G both drives and benefits from edge computing development.

5GEdge ComputingIoT

0 likes · 11 min read

Edge Computing and Its Relationship with 5G: Concepts, Value, Applications, and Future Outlook

Alibaba Cloud Developer

Feb 1, 2020 · Artificial Intelligence

How Alibaba’s AI‑Driven Genome Platform Slashes COVID‑19 Test Time to 30 Minutes

Alibaba’s DAMO Academy and Zhejiang CDC have built an AI‑powered, fully automated whole‑genome sequencing platform that reduces COVID‑19 sample analysis from several hours to about half an hour, delivering rapid, accurate detection of viral mutations and supporting vaccine and drug research.

AICOVID-19bioinformatics

0 likes · 6 min read

How Alibaba’s AI‑Driven Genome Platform Slashes COVID‑19 Test Time to 30 Minutes

DataFunTalk

Dec 24, 2019 · Big Data

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

This article explains PySpark's multi‑process architecture, how the Python driver uses Py4J to call Java/Scala APIs, the implementation of RDD and DataFrame interfaces, executor‑side process communication and serialization with Arrow, and the design of Pandas UDFs, while also discussing current limitations and future directions.

ArrowBig DataPySpark

0 likes · 13 min read

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

AntTech

Dec 4, 2019 · Artificial Intelligence

Ant Financial’s Online Learning System Built on Ray: Architecture, Challenges, and Future Plans

The interview details how Ant Financial transitioned from offline to online machine learning by adopting the Ray distributed engine, describing their open architecture, fusion computing approach, technical advantages, encountered pitfalls, and plans to open‑source the system for broader AI and big‑data use.

AIAnt FinancialBig Data

0 likes · 15 min read

Ant Financial’s Online Learning System Built on Ray: Architecture, Challenges, and Future Plans

Architects' Tech Alliance

Sep 26, 2019 · Fundamentals

Parallel Computing vs Distributed Computing: Concepts, Principles, and Differences

This article explains the fundamentals of parallel and distributed computing, their definitions, core principles, advantages, required conditions, and key differences, highlighting how each approach tackles large‑scale tasks within high‑performance computing environments.

HPCHigh‑performance computingcomputing fundamentals

0 likes · 6 min read

Parallel Computing vs Distributed Computing: Concepts, Principles, and Differences

Architects' Tech Alliance

Sep 10, 2019 · Cloud Computing

Understanding Edge Computing: Concepts, Benefits, and Deployment Scenarios

Edge computing moves data collection and analysis closer to the source, reducing latency, bandwidth costs, and privacy risks, and is increasingly essential for IoT, industrial automation, autonomous vehicles, smart cities, and other latency‑sensitive applications.

IoTLow latencyReal-time analytics

0 likes · 15 min read

Understanding Edge Computing: Concepts, Benefits, and Deployment Scenarios

Big Data Technology & Architecture

Aug 3, 2019 · Big Data

Understanding SparkEnv Initialization: Components and Their Setup

This article walks through the SparkEnv initialization process in Apache Spark, detailing how the driver and executor environments are created, the key components such as SecurityManager, RpcEnv, SerializerManager, BroadcastManager, MapOutputTracker, ShuffleManager, MemoryManager, BlockManager, MetricsSystem, and OutputCommitCoordinator are instantiated, and how the final SparkEnv instance is assembled and stored.

Big DataScalaSpark

0 likes · 13 min read

Understanding SparkEnv Initialization: Components and Their Setup

Tencent Cloud Developer

Jul 18, 2019 · Big Data

Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform

Tencent’s iData analysis center selected Spark as its new computing platform because, unlike ElasticSearch, TiDB, and other MPP solutions, Spark offers iterative processing, shuffle support, robust SQL and DAG scheduling, and flexible SMP‑style data exchange, enabling efficient OLAP on billions of game‑user records.

Big DataData PlatformMPP

0 likes · 13 min read

Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform

Big Data Technology & Architecture

Apr 7, 2019 · Big Data

Understanding YARN: Background, Architecture, and Execution Process

This article explains why YARN was created to overcome the limitations of MapReduce 1.x, describes its architecture—including ResourceManager, NodeManager, ApplicationMaster, Container, and Client—and outlines the step‑by‑step execution flow that enables multiple computation frameworks to run on Hadoop.

Big DataHadoopYARN

0 likes · 11 min read

Understanding YARN: Background, Architecture, and Execution Process

Big Data Technology & Architecture

Apr 2, 2019 · Big Data

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

The article explains Hadoop's MapReduce framework as both a programming model and execution engine, detailing its map and reduce phases, the WordCount example code, job startup components, data shuffling, partitioning, and how large‑scale distributed computations are orchestrated across a cluster.

Big DataHadoopMapReduce

0 likes · 10 min read

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

Big Data Technology & Architecture

Mar 2, 2019 · Big Data

Understanding and Using Broadcast Variables in Apache Flink

This article explains the concept, usage, precautions, and a practical example of broadcast variables in Apache Flink, illustrating how to initialize, broadcast, retrieve, and apply shared data across parallel operators with Java code snippets.

Big DataBroadcast VariableFlink

0 likes · 4 min read

Understanding and Using Broadcast Variables in Apache Flink

Sohu Tech Products

Feb 13, 2019 · Big Data

Evolution and Implementation Details of Spark Shuffle Mechanisms

This article examines the historical evolution of Spark's shuffle implementations—from early Hash‑Based Shuffle to modern SortShuffleWriter, BypassMergeSortShuffleWriter, and UnsafeShuffleWriter—explaining their design choices, selection criteria, and the corresponding shuffle reader architecture in a production‑grade Spark 2.1.1 environment.

Big DataShuffleShuffle Writer

0 likes · 13 min read

Evolution and Implementation Details of Spark Shuffle Mechanisms

Alibaba Cloud Developer

Jan 18, 2019 · Artificial Intelligence

How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning

Euler, Alibaba's newly open‑sourced graph deep‑learning framework, combines distributed graph processing with neural network training to handle billions of nodes and edges, supports heterogeneous graphs, offers built‑in algorithms, and has already boosted advertising, fraud detection, and other industry applications.

AI InfrastructureEuler frameworkdistributed computing

0 likes · 11 min read

How Alibaba’s Open‑Source Euler Framework Powers Large‑Scale Graph Deep Learning

Alibaba Cloud Developer

Jan 17, 2019 · Artificial Intelligence

How Alibaba’s Mars Engine Brings Tensor‑Based Scientific Computing to Distributed Scale

Alibaba’s open‑source Mars engine extends NumPy‑style tensor operations to distributed environments, leveraging GPU acceleration, sparse matrices, and flexible scheduling to dramatically boost scientific and AI workloads beyond single‑machine limits.

GPU AccelerationMarsTensor

0 likes · 10 min read

How Alibaba’s Mars Engine Brings Tensor‑Based Scientific Computing to Distributed Scale

JD Tech

Jan 11, 2019 · Big Data

Spark Memory Management and Tuning Practices for Large-Scale Billing Systems

This article explains how Spark's memory management models and configuration parameters can be tuned to handle massive billing data efficiently, covering StaticMemoryManager vs UnifiedMemoryManager, storage and shuffle memory fractions, common OOM and file‑not‑found issues, and practical performance‑optimisation tips.

Memory ManagementSparkdistributed computing

0 likes · 9 min read

Spark Memory Management and Tuning Practices for Large-Scale Billing Systems

Architects Research Society

Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

ApacheBig DataFlink

0 likes · 22 min read

Overview of Major Apache Big Data Processing Frameworks

Alibaba Cloud Developer

Dec 18, 2018 · Databases

Inside Alibaba AnalyticDB: Architecture, Core Technologies, and Real‑Time Data Warehouse Innovations

This article provides an in‑depth technical overview of Alibaba's AnalyticDB, covering the challenges of massive real‑time analytics, the cloud‑native multi‑tenant architecture, data model, import/export capabilities, high‑performance SQL parser, the Xuanwu storage engine, Xihe compute engine, optimizer, GPU acceleration, and elastic scaling features.

AnalyticDBGPU AccelerationSQL Parser

0 likes · 38 min read

Inside Alibaba AnalyticDB: Architecture, Core Technologies, and Real‑Time Data Warehouse Innovations

Tencent Cloud Developer

Oct 30, 2018 · Big Data

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

The article reviews recent big-data trends—from Hadoop’s evolution and Spark’s in-memory advances to emerging storage like Ozone—while detailing data-warehouse models, query-optimizer techniques, and cloud-native architectures that integrate diverse data sources, enabling scalable, AI-ready analytics and modern data-lake capabilities.

Big DataData LakeData Warehouse

0 likes · 30 min read

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

Java Backend Technology

Oct 13, 2018 · Big Data

Check a New Integer Among 4 Billion Records in Seconds Using Bitmap & Distributed Methods

An interviewee faces the challenge of determining whether a newly given integer exists within a set of 4 billion numbers, and the article explores efficient solutions—from naive disk‑I/O approaches to distributed processing and the memory‑saving bitmap technique—highlighting their performance trade‑offs and implementation details.

Big DataBitmapalgorithm

0 likes · 6 min read

Check a New Integer Among 4 Billion Records in Seconds Using Bitmap & Distributed Methods

Architects' Tech Alliance

Oct 9, 2018 · Fundamentals

Parallel Computing vs Distributed Computing: Concepts, Principles, and Differences

This article explains the concepts, principles, and key distinctions between parallel computing and distributed computing, describing their objectives, basic conditions, advantages, and typical use cases within high‑performance computing, and highlights how they differ from grid and cloud computing.

HPCcomputing fundamentalsdistributed computing

0 likes · 6 min read

Senior Brother's Insights

Aug 31, 2018 · Big Data

How to Test Membership in 4 Billion Integers with Bitmap and Distributed Techniques

An interview question about checking whether a new integer belongs to a set of 4 billion numbers leads to a discussion of distributed loading across eight machines, bitmap representation using 500 MB of memory, and interval‑based external sorting, illustrating practical big‑data algorithm design.

Big DataBitmapData Structures

0 likes · 7 min read

How to Test Membership in 4 Billion Integers with Bitmap and Distributed Techniques

DataFunTalk

Aug 14, 2018 · Artificial Intelligence

Machine Learning and Deep Learning Engineering Practices at Ping An Life

The article summarizes senior AI expert Wu Jianjun’s presentation on machine‑learning and deep‑learning engineering at Ping An Life, detailing the company’s big‑data platform, data processing pipelines, model training frameworks, distributed computing strategies, and production model‑serving architecture for financial applications.

Deep LearningModel Servingdistributed computing

0 likes · 15 min read

Machine Learning and Deep Learning Engineering Practices at Ping An Life

dbaplus Community

Aug 6, 2018 · Big Data

Understanding RAID, HDFS, and MapReduce: From Storage to Distributed Computing

This article explains the storage challenges of big data, introduces RAID levels and their trade‑offs, describes the HDFS architecture with NameNode and DataNode replication, details the MapReduce programming model and execution flow, and shows how Hive translates SQL queries into MapReduce jobs.

Big DataHDFSHive

0 likes · 23 min read

Understanding RAID, HDFS, and MapReduce: From Storage to Distributed Computing

Big Data and Microservices

Jul 24, 2018 · Big Data

Why Hadoop Still Leads Big Data Processing: Core Advantages Explained

This article introduces Hadoop’s open‑source big‑data framework, explains its core components HDFS and MapReduce, and outlines four key advantages—ease of deployment, robustness, scalability, and simplicity—while also covering HBase as the Hadoop‑based column‑oriented database.

Big DataHBaseHDFS

0 likes · 4 min read

Why Hadoop Still Leads Big Data Processing: Core Advantages Explained

Alibaba Cloud Developer

Jul 23, 2018 · Big Data

How Alibaba’s MaxCompute Became the Backbone of 99% Data Processing

This article reviews Alibaba's MaxCompute evolution from ODPS to a unified, multi‑cluster big‑data platform, detailing its architecture, development tools, large‑scale deployments, performance optimizations, typical workload scenarios, and why it is the preferred choice for enterprise data processing.

Alibaba CloudBig DataData Platform

0 likes · 22 min read

How Alibaba’s MaxCompute Became the Backbone of 99% Data Processing

dbaplus Community

May 23, 2018 · Big Data

Understanding MapReduce: A Simple Analogy to Master Big Data Distributed Computing

This article uses a human‑computer analogy and a playing‑card counting example to explain the fundamentals of distributed computing, why single machines cannot handle massive data, and how the MapReduce model’s four steps—split, transform, shuffle, and merge—solve big‑data problems.

Big DataMapReducedata-processing

0 likes · 15 min read

Understanding MapReduce: A Simple Analogy to Master Big Data Distributed Computing

High Availability Architecture

May 21, 2018 · Big Data

Interview with Baidu’s Chief Big Data Architect Ma Ruyue on OLAP, HTAP, and Emerging Big Data Technologies

In this interview, Baidu’s senior big‑data architect Ma Ruyue discusses his career transition from Hadoop to online databases, the design philosophy behind Baidu’s Palo ROLAP system, the future of HTAP, and his views on the evolving big‑data ecosystem including Spark, AI, and containerization.

Data ArchitectureHTAPOLAP

0 likes · 11 min read

Interview with Baidu’s Chief Big Data Architect Ma Ruyue on OLAP, HTAP, and Emerging Big Data Technologies

21CTO

May 17, 2018 · Big Data

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

This article explains Hadoop's core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker architecture to improve scalability and resource utilization in large‑scale data processing clusters.

HadoopShuffleYARN

0 likes · 12 min read

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

Architects Research Society

Apr 6, 2018 · Blockchain

Understanding Ethereum: Decentralizing the Internet and Building a Global Computer

The article explains how Ethereum aims to replace centralized cloud services with a decentralized network of nodes, giving users control over their data and creating a global, democratic computing platform, while also noting the challenges of security, usefulness, and scalability.

BlockchainEthereumWeb3

0 likes · 6 min read

Understanding Ethereum: Decentralizing the Internet and Building a Global Computer

Meituan Technology Team

Mar 22, 2018 · Big Data

High-Performance User Behavior Analysis Solution for Massive Data

The paper describes a high‑performance user‑behavior analysis system that processes hundreds of billions of daily logs for Meituan‑Dianping, using an inverted‑index structure with bitmap UUID sets and timestamp sequences, combined with Spark, Spring and Alluxio optimizations to cut query times from hours to under five seconds.

Big DataOLAP analysisdistributed computing

0 likes · 14 min read

High-Performance User Behavior Analysis Solution for Massive Data

Architecture Digest

Feb 28, 2018 · Blockchain

Blockchain Infrastructure Landscape: A First‑Principles Framework

This article presents a first‑principles framework that categorizes blockchain infrastructure components—storage, computation, and communication—by mapping them to concrete projects such as Ethereum, IPFS, BigchainDB, and others, illustrating how these modules interoperate to build efficient decentralized applications.

BlockchainInfrastructuredecentralized storage

0 likes · 21 min read

Blockchain Infrastructure Landscape: A First‑Principles Framework

vivo Internet Technology

Jan 31, 2018 · Big Data

Predicate Pushdown Rules in SparkSql Inner Join Queries

SparkSql optimizes inner‑join queries by pushing predicates to the scan phase, allowing filters connected with AND to be applied before the join without changing results, while OR‑connected filters can be unsafe except when they involve the join key or partitioned tables which use partition pruning.

JOIN optimizationPredicate PushdownSQL Optimization

0 likes · 10 min read

Predicate Pushdown Rules in SparkSql Inner Join Queries

vivo Internet Technology

Nov 3, 2017 · Artificial Intelligence

Integrating Distributed TensorFlow with Kubernetes: Architecture and Deployment

The article explains how to combine Distributed TensorFlow with Kubernetes—using GlusterFS storage, Deployments for parameter servers, Jobs for workers, service discovery, monitoring, and a Jinja2‑generated YAML template—to create isolated, scalable training clusters with Jupyter and TensorBoard access.

DevOpsGlusterFSKubernetes

0 likes · 12 min read

Integrating Distributed TensorFlow with Kubernetes: Architecture and Deployment

dbaplus Community

Sep 20, 2017 · Big Data

Scaling TB‑Level Price Computations with Apache Spark: Suning’s Architecture and Optimizations

This article details how Suning built a Hadoop‑based big data platform and leveraged Apache Spark to process terabytes of product price and inventory data, describing the system architecture, four key technical practices, performance results, and future data‑lake directions.

Apache SparkDataFramesETL

0 likes · 12 min read

Scaling TB‑Level Price Computations with Apache Spark: Suning’s Architecture and Optimizations

Java Backend Technology

Aug 24, 2017 · Big Data

Step-by-Step Guide to Building a 3-Node Apache Storm Cluster on CentOS

This tutorial walks you through setting up a three‑node Apache Storm cluster on CentOS 6.9, covering hostname configuration, firewall disabling, Zookeeper preparation, Storm installation, startup of Nimbus, UI and supervisors, and finally submitting a sample topology.

Apache StormCentOSCluster Setup

0 likes · 8 min read

Step-by-Step Guide to Building a 3-Node Apache Storm Cluster on CentOS

37 Interactive Technology Team

Jun 13, 2017 · Big Data

MapReduce Principles and Hadoop Execution Process with WordCount Example

The article explains MapReduce’s divide‑and‑conquer model and Hadoop’s execution pipeline—including map, partition, spill, merge, shuffle, and reduce phases—illustrated with a WordCount example that shows how mappers emit word‑1 pairs and reducers aggregate counts to produce final frequencies on HDFS.

HadoopMapReduceShuffle

0 likes · 7 min read

MapReduce Principles and Hadoop Execution Process with WordCount Example

dbaplus Community

Jun 7, 2017 · Big Data

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

This comprehensive guide walks you through MapReduce fundamentals, the complete execution flow, and seven hands‑on Hadoop projects—including WordCount, custom serialization, custom partitioning, grouping comparators, file merging, multiple outputs, join operations, and friend‑graph analysis—while providing environment setup steps, Maven commands, and Hadoop CLI examples.

CLIHadoopJava

0 likes · 28 min read

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

360 Quality & Efficiency

Apr 24, 2017 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and Common Commands

This article introduces Hadoop as a widely used big‑data framework, explains its core components HDFS and MapReduce, describes the cluster node roles, presents typical command‑line usage and a sample MapReduce workflow, and offers guidance for further learning.

HDFSHadoopMapReduce

0 likes · 5 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and Common Commands

Java High-Performance Architecture

Apr 4, 2017 · Big Data

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

This tutorial quickly introduces the MapReduce model, explains its core principles and execution flow, and guides you through seven practical examples—from basic WordCount to custom serialization, partitioning, joins, and friend‑recommendation—while providing test data and an optional ready‑made Hadoop environment for hands‑on practice.

HadoopMapReduceTutorial

0 likes · 3 min read

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

ITPUB

Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataMapReduceRDD

0 likes · 25 min read

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

Huawei Cloud Developer Alliance

Jan 24, 2017 · Big Data

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

This article provides a comprehensive overview of Hadoop as the leading open‑source platform for big‑data processing, detailing its core components HDFS and MapReduce, the evolution to Hadoop 2.0/YARN, and the extensive ecosystem of tools and commercial solutions that enable scalable storage, analysis, and machine‑learning on massive data sets.

Big DataHDFSHadoop

0 likes · 18 min read

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

Architecture Digest

Jan 24, 2017 · Artificial Intelligence

TensorFlow: Large‑Scale Machine Learning on Heterogeneous Distributed Systems – Overview and Implementation

TensorFlow is a dataflow‑based programming model for large‑scale machine learning that uses directed acyclic graphs to represent computations, supports single‑device, multi‑device, and distributed execution with sophisticated node placement, communication, fault‑tolerance, and optimization techniques, and provides tools such as TensorBoard for visualization.

Dataflow GraphParallelismTensorFlow

0 likes · 13 min read

TensorFlow: Large‑Scale Machine Learning on Heterogeneous Distributed Systems – Overview and Implementation

Art of Distributed System Architecture Design

Dec 31, 2016 · Big Data

Understanding Hadoop: Architecture, HDFS, and MapReduce

This article explains Hadoop as an Apache‑managed open‑source platform for storing massive data on distributed clusters and running robust, efficient analytics via its two core components—HDFS for storage and the Java‑based MapReduce framework for processing—highlighting modularity, high availability, and common tooling.

HDFSHadoopMapReduce

0 likes · 6 min read

Understanding Hadoop: Architecture, HDFS, and MapReduce

Java High-Performance Architecture

Dec 13, 2016 · Big Data

What Is Apache Beam and How Does It Simplify Distributed Data Processing?

Apache Beam is an open‑source, unified programming model for distributed data processing that lets developers write pipelines once and run them on multiple execution engines such as Spark, Flink, or Dataflow, simplifying code reuse and easing migration between frameworks.

Apache BeamJavaSpark

0 likes · 5 min read

What Is Apache Beam and How Does It Simplify Distributed Data Processing?

Architecture Digest

Nov 16, 2016 · Big Data

A Decade of Hadoop: History, Architecture, Ecosystem, and Future Outlook

This article chronicles Hadoop’s ten‑year evolution from its early HDFS and MapReduce roots to a mature big‑data platform, detailing its historical milestones, architectural layers, ecosystem components, industry adoption, and future trends in storage, processing, security, and cloud integration.

EcosystemHadoopdistributed computing

0 likes · 36 min read

A Decade of Hadoop: History, Architecture, Ecosystem, and Future Outlook

StarRing Big Data Open Lab

Oct 8, 2016 · Big Data

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

Data warehouses centralize and transform enterprise data for multidimensional analysis, and modern demands have spawned four types—traditional, real‑time, associative discovery, and data marts—each with distinct technical requirements, while Hadoop‑based solutions like Transwarp Data Hub address challenges of scale, variety, latency, and security.

Big DataHadoopReal-time analytics

0 likes · 21 min read

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

Meituan Technology Team

Aug 5, 2016 · Big Data

Takeaway Big Data Optimization Strategies

Meituan’s delivery team leverages massive real‑time data mining, distributed parallel optimization and a simulation platform to intelligently assign orders, boost rider efficiency, cut costs, and continuously refine algorithms, while integrating offline‑online learning and upstream‑downstream coordination to enhance overall logistics performance.

AI in deliveryLogistics Optimizationdistributed computing

0 likes · 9 min read

Takeaway Big Data Optimization Strategies

Architecture Digest

Jul 5, 2016 · Big Data

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

The article reviews Hadoop’s origins from Google’s pioneering papers, explains its architecture and ecosystem, evaluates its strengths such as scalability and benchmarks, discusses current limitations like single‑point failures and complex programming, and outlines upcoming improvements including HDFS Federation and next‑generation MapReduce.

Big DataFutureHDFS

0 likes · 14 min read

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

Architecture Digest

May 4, 2016 · Big Data

Upgrading Spark from 1.4.1 to 1.6.1: Memory, Storage, and Operational Challenges

The article details the author’s experience upgrading a production Spark cluster from version 1.4.1 to 1.6.1, exposing memory‑spill, unified memory, BlockManager deadlock, Yarn‑kill, UI quirks, and Spark‑SQL compatibility issues, and proposes concrete code‑level fixes for each problem.

Big DataMemory ManagementShuffle

0 likes · 14 min read

Upgrading Spark from 1.4.1 to 1.6.1: Memory, Storage, and Operational Challenges

21CTO

Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

RDDSparkSparkSQL

0 likes · 17 min read

How Spark Runs on YARN: From Client Submission to Executor Execution

Architecture Digest

Apr 18, 2016 · Big Data

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.

Big DataRDDSpark

0 likes · 14 min read

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

Architecture Digest

Apr 9, 2016 · Big Data

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

This article describes how Meituan migrated from Hive‑SQL and MapReduce to Spark on YARN, built an interactive Zeppelin‑based development platform, created reusable ETL templates, constructed a Spark‑driven feature and data‑mining platform, and applied Spark to interactive user‑behavior analysis and large‑scale SEM services, highlighting performance gains and operational benefits.

Big DataData PlatformETL

0 likes · 19 min read

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

ITPUB

Feb 20, 2016 · Big Data

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

The article chronicles Doug Cutting’s path from his Stanford studies and early Xerox work through the creation of Lucene, Nutch, and Hadoop, highlighting how open‑source innovations and Google’s technologies propelled Hadoop to become a cornerstone of modern big‑data processing and its future outlook.

Big DataDoug CuttingHadoop

0 likes · 15 min read

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

ITPUB

Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataHiveSparkSQL

0 likes · 13 min read

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

Qunar Tech Salon

Dec 13, 2015 · Big Data

Introduction to Distributed Computing: Sharding, Message Queues, Hadoop and MapReduce

This article explains the fundamentals of distributed computing, covering sharding algorithms, message‑queue based task distribution, an overview of Hadoop and its MapReduce model, and the characteristics of offline batch processing for large‑scale data workloads.

HadoopMapReduceMessage Queue

0 likes · 11 min read

Introduction to Distributed Computing: Sharding, Message Queues, Hadoop and MapReduce

21CTO

Nov 26, 2015 · Big Data

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

This article explores the 4V characteristics of big data, real‑world data growth examples, historical analogies, Google’s GFS‑MapReduce‑BigTable model, Hadoop’s architecture and HDFS processes, HBase components, NoSQL alternatives, and practical big‑data applications at Tencent and beyond.

Data ArchitectureHadoopMapReduce

0 likes · 7 min read

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

Art of Distributed System Architecture Design

Aug 22, 2015 · Big Data

Leveraging Bitmap for High‑Performance Multi‑Dimensional Analytics in Big Data

This article explains how bitmap data structures, combined with compression and in‑memory techniques, enable fast, flexible, and scalable multi‑dimensional analytics for large‑scale data platforms, addressing historical marketing inefficiencies and outlining future directions such as memory‑mapped files and distributed bitmap computation.

BitmapMulti-dimensional Analyticscompression

0 likes · 19 min read

Leveraging Bitmap for High‑Performance Multi‑Dimensional Analytics in Big Data

21CTO

Aug 11, 2015 · Big Data

Understanding MapReduce Through a Pizza Sauce Analogy

The author recounts delivering a MapReduce talk, then uses a vivid pizza sauce preparation story to illustrate how mapping chops ingredients and reducing blends them, effectively explaining distributed data processing concepts to a non‑technical audience.

AnalogyMapReducedata-processing

0 likes · 7 min read

Understanding MapReduce Through a Pizza Sauce Analogy

Efficient Ops

Jun 25, 2015 · Big Data

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

BaiduDAGHadoop

0 likes · 21 min read

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

Qunar Tech Salon

Dec 4, 2014 · Big Data

Understanding Apache Spark: Architecture, Comparison with Hadoop, Features, and Use Cases

The article explains Apache Spark’s memory‑based distributed computing model, its advantages over Hadoop’s MapReduce, key features, fault tolerance, deployment modes, ecosystem components, and the scenarios where Spark is most effective for large‑scale data analytics.

HadoopSparkdata-processing

0 likes · 7 min read

Understanding Apache Spark: Architecture, Comparison with Hadoop, Features, and Use Cases

ITPUB

Oct 30, 2014 · Big Data

Inside Fourinone: A Lightweight Distributed Framework Challenging Hadoop

The interview with Fourinone founder Peng Yuan explores the framework's evolution from a parallel computing project to a 220 KB distributed system with its own NoSQL database engine CoolHash, compares it to Hadoop, and discusses its open‑source release, technical design choices, and real‑world deployments in finance and enterprise environments.

Big DataCoolHashFourinone

0 likes · 31 min read

Inside Fourinone: A Lightweight Distributed Framework Challenging Hadoop