Tagged articles
607 articles
Page 6 of 7
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2019 · Big Data

Understanding Spark Streaming Backpressure Mechanism

The article explains how Spark Streaming backpressure, introduced in version 1.5, automatically adjusts data ingestion rates based on processing delays, replaces manual rate limits, and details its architecture, configuration parameters, and usage for preventing data backlog and executor OOM.

Big DataRate ControlSpark
0 likes · 6 min read
Understanding Spark Streaming Backpressure Mechanism
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 1, 2019 · Big Data

Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview

The meetup on September 8, co‑hosted by InfoQ and Huawei Cloud, featured Databricks engineer Tathagata Das explaining Spark Structured Streaming’s concepts, fault‑tolerance, performance, event‑time handling, and real‑world use cases such as Apple’s security platform, highlighting its scalability and integration with various data sources.

Big DataSparkStructured Streaming
0 likes · 8 min read
Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview
Architects Research Society
Architects Research Society
Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

ApacheBig DataFlink
0 likes · 22 min read
Overview of Major Apache Big Data Processing Frameworks
21CTO
21CTO
Nov 20, 2018 · Big Data

What Languages and Tools Do Big Data Experts Use? Insights from 31 IT Leaders

Based on interviews with 31 IT leaders from 28 organizations, this article reveals the most popular programming languages, frameworks, and platforms—such as Python, Scala, Spark, Kafka, TensorFlow, and Tableau—currently driving big‑data extraction, analysis, and reporting, and highlights emerging trends and tool preferences.

Big DataKafkaPython
0 likes · 12 min read
What Languages and Tools Do Big Data Experts Use? Insights from 31 IT Leaders
dbaplus Community
dbaplus Community
Nov 4, 2018 · Databases

How Spark Turns Traditional Databases into Powerful OLAP Engines

This article examines why traditional relational databases like MySQL struggle with analytical workloads, compares ROLAP and MOLAP approaches, explains Spark’s architecture and its advantages for OLAP, and details how Alibaba Cloud’s DRDS HTAP leverages a Spark‑based engine to deliver real‑time distributed query processing.

Data WarehouseDistributed SystemsHTAP
0 likes · 11 min read
How Spark Turns Traditional Databases into Powerful OLAP Engines
Tencent Cloud Developer
Tencent Cloud Developer
Oct 30, 2018 · Big Data

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

The article reviews recent big-data trends—from Hadoop’s evolution and Spark’s in-memory advances to emerging storage like Ozone—while detailing data-warehouse models, query-optimizer techniques, and cloud-native architectures that integrate diverse data sources, enabling scalable, AI-ready analytics and modern data-lake capabilities.

Big DataData LakeData Warehouse
0 likes · 30 min read
Big Data Technology Trends and Cloud Data Warehouse Architecture Practices
dbaplus Community
dbaplus Community
Aug 21, 2018 · Big Data

Master Spark Performance: Practical Development and Resource Tuning Guide

This article explains why Spark needs careful performance tuning, then details concrete development‑level optimizations (RDD reuse, persistence, shuffle avoidance, broadcast variables, Kryo serialization, data‑structure choices) and resource‑level settings (executor count, memory, cores, parallelism, memory fractions) with code examples and practical recommendations.

Broadcast VariablesKryo SerializationRDD
0 likes · 32 min read
Master Spark Performance: Practical Development and Resource Tuning Guide
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 13, 2018 · Big Data

How Ele.me Evolved Its Real‑Time Engine: From Storm to Flink

This article examines Ele.me’s big‑data platform evolution, comparing Storm, Spark Streaming, Structured Streaming, and Flink, detailing their architectures, consistency semantics, performance trade‑offs, and why Flink became the preferred real‑time computation engine for the company.

Big DataFlinkSpark
0 likes · 15 min read
How Ele.me Evolved Its Real‑Time Engine: From Storm to Flink
ITPUB
ITPUB
Jun 14, 2018 · Big Data

Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices

Amid declining Hadoop usage reports, Suning.com’s 2018‑2020 big‑data platform case study reveals why the retailer still relies on Hadoop’s mature ecosystem, how it integrates HDFS, HBase, YARN, Hive, Spark, Flink and emerging tools, and what future resource‑management plans it envisions.

Data PlatformFlinkHadoop
0 likes · 11 min read
Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices
ITPUB
ITPUB
Jun 10, 2018 · Big Data

13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem

This article introduces Hadoop’s origins and core challenges, then presents thirteen essential open‑source tools spanning resource scheduling, real‑time query engines, and additional processing frameworks, detailing each project's purpose, key features, and repository locations to help practitioners choose the right component for big‑data workloads.

HadoopImpalaSpark
0 likes · 12 min read
13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem
ITPUB
ITPUB
Jun 4, 2018 · Big Data

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

Despite Gartner's 2017 claim that Hadoop is nearing the end of its production maturity, a series of interviews with Chinese big‑data experts reveal that Hadoop's ecosystem remains robust, with core components like HDFS, YARN, Spark, and HBase continuing to dominate the market.

Big DataEcosystemGartner
0 likes · 9 min read
Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong
ITPUB
ITPUB
Jun 3, 2018 · Big Data

Spark vs Hadoop: Which Distributed System Fits Your Data Needs?

An in‑depth comparison of Hadoop and Spark examines their architectures, performance, cost, security, and machine‑learning capabilities, helping readers decide which open‑source distributed processing platform best matches their batch, streaming, and analytical workloads.

Big DataCostHadoop
0 likes · 13 min read
Spark vs Hadoop: Which Distributed System Fits Your Data Needs?
ITPUB
ITPUB
Jun 2, 2018 · Big Data

Mastering Spark: Core Concepts, Architecture, Streaming & Performance Tuning

This comprehensive guide explains Spark's ecosystem, execution principles, key features, deployment architectures, core concepts like RDD, Transformations, Actions, Jobs, Stages, Shuffle and Cache, as well as Spark Streaming mechanics and practical resource‑tuning tips for optimal big‑data processing.

Big DataClusterRDD
0 likes · 15 min read
Mastering Spark: Core Concepts, Architecture, Streaming & Performance Tuning
ITPUB
ITPUB
May 31, 2018 · Big Data

Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills

This article explains Spark's role in the DataMagic platform, outlines four practical steps to quickly master Spark, details key configuration and parallelism settings, shows how to modify Spark code, and provides operational tips for cluster management and job troubleshooting.

Big DataCluster ManagementConfiguration
0 likes · 10 min read
Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills
dbaplus Community
dbaplus Community
May 30, 2018 · Big Data

Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Strategies

This article explains Spark's executor memory architecture, covering on‑heap and off‑heap allocation, static versus unified memory managers, storage and execution memory handling, RDD persistence levels, eviction policies, and shuffle memory usage, providing practical formulas and configuration tips for optimal performance.

Big DataExecutorMemory Management
0 likes · 23 min read
Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Strategies
Tencent Cloud Developer
Tencent Cloud Developer
Apr 12, 2018 · Big Data

Spark Usage in DataMagic Platform: A Practical Guide

This guide explains how DataMagic leverages Spark on YARN for fast, scalable offline analytics—covering Spark’s core role, four steps to master its terminology, configurations, parallelism, and code modification, plus practical deployment scripts, dynamic resource tuning, MongoDB export, job troubleshooting, and cluster upkeep for trillion‑record workloads.

DataMagicSparkSpark optimization
0 likes · 11 min read
Spark Usage in DataMagic Platform: A Practical Guide
Qunar Tech Salon
Qunar Tech Salon
Apr 9, 2018 · Big Data

Analysis of Apache Spark 2.2.1 Memory Management Model

This article examines Spark's unified memory manager in version 2.2.1, detailing on‑heap and off‑heap memory regions, the four on‑heap memory pools, dynamic execution‑storage memory sharing, task memory accounting, and provides concrete calculation examples to explain UI discrepancies and runtime memory limits.

Big DataExecutorMemory Management
0 likes · 13 min read
Analysis of Apache Spark 2.2.1 Memory Management Model
MaGe Linux Operations
MaGe Linux Operations
Jan 14, 2018 · Artificial Intelligence

7 Essential Python Tools Every Data Scientist Must Master

This article introduces seven must‑know Python tools—including IPython, GraphLab Create, Pandas, PuLP, Matplotlib, Scikit‑Learn, and Spark—explaining their key features and how they empower data scientists to work efficiently in production environments.

Data ScienceGraphLabIPython
0 likes · 9 min read
7 Essential Python Tools Every Data Scientist Must Master
Meituan Technology Team
Meituan Technology Team
Oct 12, 2017 · Big Data

Spark Aggregation Operations Deep Dive

The seminar provides professionals with a comprehensive deep dive into Spark’s aggregation mechanisms, examining memory overhead, performance bottlenecks, and practical optimization techniques for large‑scale distributed data processing, enabling attendees to tackle real‑world big‑data challenges more efficiently.

SparkTechnical Seminardata aggregation
0 likes · 1 min read
Spark Aggregation Operations Deep Dive
dbaplus Community
dbaplus Community
Sep 26, 2017 · Big Data

How to Avoid Common Spark SQL Pitfalls and Boost Performance

This article shares a comprehensive set of practical tips and solutions for common Spark SQL issues—including out‑of‑memory errors, UDF‑induced GC, thread blocking, system‑property initialization, speculation side‑effects, accumulator traps, concurrent job scheduling, and excessive logging—helping engineers improve stability and efficiency of their Spark‑based financial systems.

AccumulatorMemory ManagementSpark
0 likes · 15 min read
How to Avoid Common Spark SQL Pitfalls and Boost Performance
21CTO
21CTO
Sep 26, 2017 · Big Data

How NTE Algorithm Accelerates New Common‑Friend Discovery in Billion‑Scale Graphs

Introducing the NTE (New Triangle Enumeration) algorithm, a divide‑and‑conquer approach that transforms the computation of newly added common friends in massive social graphs into efficient triangle enumeration tasks, with detailed implementations using GraphX‑based GTE, join‑based JTE, and sort‑based STE methods.

GraphXSocial Network AnalysisSpark
0 likes · 12 min read
How NTE Algorithm Accelerates New Common‑Friend Discovery in Billion‑Scale Graphs
Qunar Tech Salon
Qunar Tech Salon
Sep 25, 2017 · Big Data

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

This article provides an extensive overview of Apache Spark’s ecosystem—including its data‑warehouse capabilities, ML/MLlib libraries, streaming with Spark Streaming, external frameworks, and real‑world enterprise case studies—while also noting a promotional announcement for a React Native conference.

Big DataData WarehouseHive
0 likes · 21 min read
Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases
Hujiang Technology
Hujiang Technology
Sep 15, 2017 · Fundamentals

Technical Salon – Evolving Architecture Practices (Shanghai, Sep 24)

The event showcases three technical talks covering the evolution of Hujiang's storage architecture, the transformation of the Dianrong payment system into a public service, and an introduction to the TiSpark project that integrates TiDB with Spark, highlighting design choices, trade‑offs, and future directions.

CephSparkTiDB
0 likes · 3 min read
Technical Salon – Evolving Architecture Practices (Shanghai, Sep 24)
dbaplus Community
dbaplus Community
Aug 21, 2017 · Big Data

How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples

This article explains what Spark data skew is, why it hurts performance, and presents six practical mitigation techniques—including adjusting parallelism, custom partitioners, map‑side joins, and adding random prefixes—backed by detailed experiments, code snippets, and performance comparisons.

Data SkewMap-side JoinPartitioner
0 likes · 18 min read
How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples
21CTO
21CTO
Aug 13, 2017 · Artificial Intelligence

How Distributed Machine Learning Platforms Compare: Spark, PMLS, TensorFlow

This article surveys distributed machine‑learning platforms, classifies them into basic data‑flow, parameter‑server, and advanced data‑flow models, examines Spark, PMLS (Petuum), TensorFlow and MXNet, presents performance comparisons on EC2 instances, and discusses bottlenecks, fault tolerance, and future research directions.

Parameter ServerPerformance EvaluationSpark
0 likes · 12 min read
How Distributed Machine Learning Platforms Compare: Spark, PMLS, TensorFlow
High Availability Architecture
High Availability Architecture
Aug 2, 2017 · Artificial Intelligence

A Comparative Study of Distributed Machine Learning Platforms: Design Methods and Evaluation

This article surveys design approaches for distributed machine learning platforms, classifies them into basic dataflow, parameter‑server, and advanced dataflow models, examines examples such as Spark, PMLS, TensorFlow and MXNet, and presents performance evaluations and future research directions.

Parameter ServerPerformance EvaluationSpark
0 likes · 10 min read
A Comparative Study of Distributed Machine Learning Platforms: Design Methods and Evaluation
High Availability Architecture
High Availability Architecture
Jul 19, 2017 · Artificial Intelligence

Weiflow: A Scalable Machine Learning Workflow Framework for Sina Weibo

The article introduces Weiflow, a dual‑layer DAG‑based machine‑learning workflow framework designed for Sina Weibo, and explains how its modular XML configuration, Scala implementation, and integration with Spark, TensorFlow, Hive, Storm, and Flink improve development efficiency, scalability, and execution performance across the entire ML pipeline.

Big DataDAGScala
0 likes · 16 min read
Weiflow: A Scalable Machine Learning Workflow Framework for Sina Weibo
High Availability Architecture
High Availability Architecture
Jul 12, 2017 · Artificial Intelligence

Machine Learning Platform and Risk‑Control Applications at DianRong Net

The article presents a comprehensive overview of DianRong Net's in‑house machine‑learning platform built on Spark, its workflow, pain points it addresses, risk‑control case studies using graph mining, and practical tips for improving model performance through data, algorithms, hyper‑parameter tuning and ensemble methods.

Big DataModel OptimizationSpark
0 likes · 14 min read
Machine Learning Platform and Risk‑Control Applications at DianRong Net
21CTO
21CTO
Jun 9, 2017 · Big Data

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

This guide walks beginners through the essential big‑data ecosystem—from understanding Hadoop’s core components and mastering MapReduce, to using Hive, SparkSQL, Kafka, and real‑time frameworks like Storm, while also covering data ingestion, export, scheduling, and introductory machine‑learning techniques.

Big DataHiveSpark
0 likes · 20 min read
From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect
Architecture Digest
Architecture Digest
Jun 9, 2017 · Big Data

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

This guide walks beginners through the entire big‑data ecosystem, covering the 4V characteristics, core open‑source frameworks, Hadoop setup, Hive and SQL on Hadoop, data ingestion and export tools, task scheduling, real‑time processing with Kafka, Storm and Spark Streaming, and an introduction to machine‑learning applications.

HadoopHiveKafka
0 likes · 17 min read
A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning
MaGe Linux Operations
MaGe Linux Operations
May 24, 2017 · Big Data

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

This article explains how big data challenges traditional storage, introduces HDFS for distributed file management, describes parallel processing frameworks like MapReduce, Tez, and Spark, compares higher‑level tools such as Hive and Pig, and explores real‑time streaming and key‑value stores for low‑latency analytics.

HadoopHiveMapReduce
0 likes · 9 min read
Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming
Suning Technology
Suning Technology
May 18, 2017 · Big Data

Why Apache Flink Beats Spark and Storm in Stream Processing

This article examines Apache Flink's stream‑processing architecture, compares its native streaming model, fault‑tolerance, performance and SQL capabilities with Spark and Storm, and concludes that Flink offers a more powerful and efficient solution despite some maturity gaps.

Apache FlinkSparkStorm
0 likes · 12 min read
Why Apache Flink Beats Spark and Storm in Stream Processing
ITPUB
ITPUB
May 8, 2017 · Big Data

Master Spark Performance: Practical Tuning Tips and Real‑World Examples

This article explains essential Spark concepts, illustrates common performance bottlenecks, and provides concrete tuning strategies for memory, CPU, serialization, data locality, file I/O, and shuffle reduction, backed by real‑world examples and visual metrics.

Big DataCPU optimizationConfiguration
0 likes · 19 min read
Master Spark Performance: Practical Tuning Tips and Real‑World Examples
MaGe Linux Operations
MaGe Linux Operations
May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopHive
0 likes · 13 min read
From Storage to Real‑Time: The Evolution of Big Data Technologies
Architecture Digest
Architecture Digest
Apr 24, 2017 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical symptoms, and presents practical strategies—including business‑level adjustments, code tweaks, and platform‑specific tuning—to mitigate and resolve skew in big‑data processing.

Big DataData SkewHadoop
0 likes · 11 min read
Understanding and Solving Data Skew in Hadoop and Spark
Qunar Tech Salon
Qunar Tech Salon
Apr 11, 2017 · Big Data

Implementing Dynamic Scaling for Spark on Mesos Using Marathon and Docker

This article describes how a team migrated Spark 1.6.x running on Mesos to a Marathon‑Docker based architecture that provides dynamic executor scaling, resolves configuration and resource‑allocation issues, and improves monitoring, fault‑tolerance, and upgrade processes for large‑scale streaming workloads.

DockerDynamic ScalingMarathon
0 likes · 17 min read
Implementing Dynamic Scaling for Spark on Mesos Using Marathon and Docker
ITPUB
ITPUB
Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataMapReduceRDD
0 likes · 25 min read
Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution
Architecture Digest
Architecture Digest
Feb 11, 2017 · Big Data

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

The article describes how LeKe Sports built and continuously upgraded its Hadoop‑based big data platform—from a manual ETL‑to‑Elasticsearch reporting system to a 2.0 architecture featuring Spark Streaming, SQL‑based query layers, Elasticsearch indexing, and cloud‑native storage and backup solutions—to meet rapidly growing PB‑scale data demands.

Big DataData PlatformETL
0 likes · 5 min read
LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture
Qunar Tech Salon
Qunar Tech Salon
Jan 24, 2017 · Artificial Intelligence

Practical Approaches to Deploying Machine Learning Models: Real‑time SOA, PMML, Rserve, and Spark

This article shares practical engineering experiences for deploying machine learning models in various scenarios—real‑time low‑volume predictions via Rserve or Python‑httpserve, high‑throughput real‑time serving with PMML‑wrapped Java classes, and offline batch predictions using simple shell scripts—detailing tools, performance considerations, and implementation steps.

Model DeploymentPMMLPython
0 likes · 11 min read
Practical Approaches to Deploying Machine Learning Models: Real‑time SOA, PMML, Rserve, and Spark
Ctrip Technology
Ctrip Technology
Jan 5, 2017 · Artificial Intelligence

Practical Approaches to Deploying Machine Learning Models: PMML, Rserve, and Spark in Production

This article shares practical engineering experiences for deploying machine learning models in production, covering three typical scenarios—real‑time small data, real‑time large data, and offline predictions—and detailing how to use PMML, Rserve, Spark, shell scripts, and related tools to meet performance and operational requirements.

Model DeploymentPMMLRserve
0 likes · 12 min read
Practical Approaches to Deploying Machine Learning Models: PMML, Rserve, and Spark in Production
Hulu Beijing
Hulu Beijing
Nov 29, 2016 · Big Data

How Hulu’s Segmentation System Powers Big Data Marketing at Scale

At the 2016 WOT Big Data Technology Summit, Hulu’s senior R&D manager Zhao Kunliang presented the company’s Segmentation system, detailing its Hadoop‑based architecture, Spark and Spark Streaming processing, the custom Nesto query engine, and the challenges and innovations involved in supporting large‑scale marketing and advertising analytics.

HadoopNestoSegmentation system
0 likes · 5 min read
How Hulu’s Segmentation System Powers Big Data Marketing at Scale
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Nov 11, 2016 · Big Data

Why SQL Still Rules Big Data—and How NoSQL & NewSQL Fit In

The article explores the evolution of data processing from Hadoop and Spark to modern SQL, NoSQL, and NewSQL solutions, comparing their architectures, performance trade‑offs, and use‑cases, while illustrating concepts with examples like MapReduce, Hive, Impala, and streaming platforms such as Storm.

Big DataHadoopNewSQL
0 likes · 14 min read
Why SQL Still Rules Big Data—and How NoSQL & NewSQL Fit In
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Oct 8, 2016 · Big Data

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

Data warehouses centralize and transform enterprise data for multidimensional analysis, and modern demands have spawned four types—traditional, real‑time, associative discovery, and data marts—each with distinct technical requirements, while Hadoop‑based solutions like Transwarp Data Hub address challenges of scale, variety, latency, and security.

Big DataHadoopReal-time analytics
0 likes · 21 min read
Evolving Data Warehouses with Hadoop & Spark: Core Technologies
Ctrip Technology
Ctrip Technology
Aug 19, 2016 · Big Data

Ctrip's Big Data Architecture and Personalized Recommendation System

This article describes how Ctrip transformed its traditional application architecture into a high‑concurrency, big‑data‑driven platform, detailing storage, compute, and business‑layer redesigns that enable massive data ingestion, real‑time user‑intent services, and a scalable personalized recommendation system.

Big DataCtripHadoop
0 likes · 14 min read
Ctrip's Big Data Architecture and Personalized Recommendation System
ITPUB
ITPUB
Jul 19, 2016 · Big Data

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

The talk shares hands‑on experiences and best‑practice methods for traditional data‑warehouse processing, public and behavioral data handling in big‑data environments, and practical guidance for migrating legacy warehouses to modern Hadoop‑based platforms, emphasizing data governance, security, and performance optimization.

Big DataData GovernanceData Warehouse
0 likes · 13 min read
From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights
Efficient Ops
Efficient Ops
Jun 30, 2016 · Big Data

How Spark Enables Real‑Time Microservice Performance Profiling

This article explains how IBM Research and Cloudinsight use Apache Spark to capture, analyze, and visualize microservice communication in real time, addressing challenges of observability, bottleneck detection, and latency attribution in large‑scale cloud environments.

Operational MonitoringReal-time analyticsSpark
0 likes · 10 min read
How Spark Enables Real‑Time Microservice Performance Profiling

How BitMap Accelerates Active-Day Distribution Calculations in Big Data

BitMap, a space‑saving bit‑array structure, can replace costly I/O‑heavy Spark jobs for computing user active‑day distributions by converting joins and distinct operations into fast bitwise logic, enabling efficient 30‑day rolling metrics with minimal memory and superior performance, as demonstrated by real‑world benchmarks.

Active DaysBig DataSpark
0 likes · 8 min read
How BitMap Accelerates Active-Day Distribution Calculations in Big Data
Architecture Digest
Architecture Digest
May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Big DataData SkewShuffle Optimization
0 likes · 35 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
High Availability Architecture
High Availability Architecture
May 19, 2016 · Big Data

Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

This article provides an in‑depth technical overview of Apache Spark, covering its core concepts such as RDDs, transformation and action operations, execution models, Spark 2.0 enhancements like unified DataFrames/Datasets, whole‑stage code generation, Structured Streaming, and practical performance‑tuning guidance.

DataFramesPerformance OptimizationRDD
0 likes · 20 min read
Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features
Meituan Technology Team
Meituan Technology Team
May 13, 2016 · Big Data

Spark Performance Optimization Guide: Data Skew and Shuffle Tuning

This advanced Spark performance guide explains how data skew arises during shuffles and presents eight practical solutions—including Hive preprocessing, key filtering, increased shuffle parallelism, two‑stage aggregation, map joins, sampling, random prefixes, and combined strategies—while also detailing key shuffle‑tuning parameters such as spark.shuffle.file.buffer, spark.reducer.maxSizeInFlight, and spark.shuffle.manager to improve memory usage and execution speed.

Big DataData SkewPerformance Optimization
0 likes · 33 min read
Spark Performance Optimization Guide: Data Skew and Shuffle Tuning
Meituan Technology Team
Meituan Technology Team
Apr 29, 2016 · Big Data

Introduction to Spark in Big Data

Apache Spark, a versatile big‑data platform supporting batch processing, SQL queries, real‑time streaming, and machine‑learning workloads, dramatically accelerates data‑intensive jobs, as demonstrated by Meituan‑Dianping, where its high‑performance engine reduces execution times and enhances scalability across diverse analytical and operational pipelines.

Batch ProcessingBig DataSpark
0 likes · 1 min read
Introduction to Spark in Big Data
Architecture Digest
Architecture Digest
Apr 25, 2016 · Big Data

Curated Learning Resources for Spark and Scala Beginners

This article compiles a comprehensive list of tutorials, books, online courses, and tools to help beginners get started with Apache Spark and the Scala programming language, including setup instructions, code snippets, and links to free and paid learning materials.

Big DataLearning ResourcesScala
0 likes · 7 min read
Curated Learning Resources for Spark and Scala Beginners
21CTO
21CTO
Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

RDDSparkSparkSQL
0 likes · 17 min read
How Spark Runs on YARN: From Client Submission to Executor Execution
Java High-Performance Architecture
Java High-Performance Architecture
Apr 18, 2016 · Big Data

Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages

The article explains how Spark has become the leading open‑source big‑data platform, highlighting its superior speed, in‑memory processing, real‑time streaming, and built‑in machine‑learning library compared with Hadoop’s slower, disk‑based MapReduce approach and reliance on external storage and ML tools.

Big DataHadoopReal-time Processing
0 likes · 5 min read
Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages
21CTO
21CTO
Apr 12, 2016 · Artificial Intelligence

Designing System and Personalized Recommendation Engines with Mahout and Spark

This article explains the architecture of both system-wide and personalized recommendation modules, compares three recommendation strategies, details the use of Apache Mahout for collaborative filtering with Java code examples, and discusses cold‑start solutions within a Spark‑Hadoop stack.

MahoutSparkcold start
0 likes · 15 min read
Designing System and Personalized Recommendation Engines with Mahout and Spark
Architecture Digest
Architecture Digest
Apr 9, 2016 · Big Data

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

This article describes how Meituan migrated from Hive‑SQL and MapReduce to Spark on YARN, built an interactive Zeppelin‑based development platform, created reusable ETL templates, constructed a Spark‑driven feature and data‑mining platform, and applied Spark to interactive user‑behavior analysis and large‑scale SEM services, highlighting performance gains and operational benefits.

Big DataData PlatformETL
0 likes · 19 min read
Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications
Architecture Digest
Architecture Digest
Mar 28, 2016 · Big Data

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

This article provides a comprehensive overview of Hadoop and its surrounding ecosystem, detailing core components, storage principles, key algorithms, and a wide range of modern big‑data technologies such as Spark, Flink, Kafka, NoSQL databases, and cloud‑based processing platforms.

Big DataHadoopKafka
0 likes · 11 min read
Overview of the Hadoop Ecosystem and Modern Big Data Technologies
Architect
Architect
Mar 6, 2016 · Big Data

Clustering Geolocated User Events with DBSCAN and Spark

This article explains how to apply the DBSCAN clustering algorithm to geolocated user event data and leverage Apache Spark’s distributed processing with PairRDDs to efficiently identify frequent user regions, detect outliers, and build location‑based services such as personalized recommendations and security alerts.

Big DataDBSCANSpark
0 likes · 8 min read
Clustering Geolocated User Events with DBSCAN and Spark
Architect
Architect
Feb 29, 2016 · Big Data

Design Principles of Real-Time Distributed Streaming Systems: A Comparison of Spark and Storm

This article examines the design considerations of real-time distributed streaming systems, outlines their background and characteristics, compares the architectures of Spark Streaming and Storm, discusses primitives, message passing, high availability, storage models, and integration with production environments, providing practical insights for architects.

Distributed SystemsReal-time ProcessingSpark
0 likes · 20 min read
Design Principles of Real-Time Distributed Streaming Systems: A Comparison of Spark and Storm
ITPUB
ITPUB
Jan 20, 2016 · Big Data

How Meizu Built an Agile Big Data Platform for Millions of Users

The Meizu Tech Open Day showcased the company's rapid evolution to a data‑driven mobile internet firm, detailing its DW1.0 and DW2.0 data‑warehouse architectures, recommendation pipelines, Spark adoption, and ELK‑based log analytics, while sharing practical lessons and future challenges.

Big DataData ArchitectureData Warehouse
0 likes · 11 min read
How Meizu Built an Agile Big Data Platform for Millions of Users
Architect
Architect
Dec 31, 2015 · Big Data

Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

The article explains how to leverage Apache Spark for machine‑learning tasks, large‑scale new‑word discovery, and simple intelligent question‑answering by using Spark‑Shell, Scala code, and word2vec‑based similarity, while sharing practical tips and performance considerations.

Big DataIntelligent QANew Word Discovery
0 likes · 15 min read
Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A
Architect
Architect
Dec 2, 2015 · Big Data

Designing an Agile Data Warehouse Architecture for Internet Companies

The article outlines a practical, end‑to‑end data platform architecture for internet businesses, covering data collection, storage and analysis, sharing, real‑time processing, task scheduling, and the importance of simplicity and agility in building an agile data warehouse.

Big DataData ArchitectureData Warehouse
0 likes · 10 min read
Designing an Agile Data Warehouse Architecture for Internet Companies
dbaplus Community
dbaplus Community
Nov 27, 2015 · Big Data

Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained

This article provides a comprehensive overview of Apache Spark, covering its origins, core concepts such as RDDs, transformations, actions, dependencies, execution modes, and key components like Spark SQL, Streaming, MLlib, and GraphX, while also offering practical code examples and visual illustrations.

DataFramesGraphXMLlib
0 likes · 18 min read
Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained
21CTO
21CTO
Nov 19, 2015 · Big Data

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

This article surveys the evolution of Hadoop and its ecosystem, explains core storage and processing concepts, and introduces contemporary big‑data technologies such as Spark, Flink, Kafka, Lambda architecture, NoSQL databases, and cloud‑native solutions, highlighting their roles and trade‑offs.

Big DataFlinkHadoop
0 likes · 17 min read
Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.

Data PlatformHadoopSpark
0 likes · 16 min read
TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN
Architect
Architect
Oct 17, 2015 · Big Data

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Big DataData WarehouseHadoop
0 likes · 12 min read
Designing an Agile Data Warehouse and Data Platform for Internet Companies
Efficient Ops
Efficient Ops
Oct 14, 2015 · Big Data

Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A

During a lively “Sit and Discuss” session, experts compared Spark and Hadoop, evaluated Flink against Spark, contrasted HBase with Cassandra, explained why Kafka (and sometimes Flink) is preferred for distributed messaging, and shared insights on Tachyon’s role in modern big‑data ecosystems.

FlinkHBaseHadoop
0 likes · 10 min read
Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A