Tagged articles
3675 articles
Page 26 of 37
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 16, 2020 · Databases

How Youku Scales Billions of Video Nodes with Real‑Time Graph Databases

Facing billions of video entities and edges, Youku’s engineering team replaced traditional relational stores with a graph‑based knowledge platform, leveraging Alibaba’s Blink streaming engine and Lindorm to enable real‑time, incremental updates, unified UDF logic, and scalable feature computation for search and recommendation.

Big DataGraph DatabaseKnowledge graph
0 likes · 10 min read
How Youku Scales Billions of Video Nodes with Real‑Time Graph Databases
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 16, 2020 · Big Data

Hot‑Warm Architecture in Elasticsearch 5.x: Node Types, Index Allocation and Curator Automation

The article explains how to design a time‑based Elasticsearch cluster using a hot‑warm architecture with dedicated master, hot, and warm nodes, shows how to configure node attributes, allocate indices via settings or Curator, and discusses best‑practice compression and rollover strategies for large‑scale log data.

Big DataElasticsearchHot‑Warm Architecture
0 likes · 8 min read
Hot‑Warm Architecture in Elasticsearch 5.x: Node Types, Index Allocation and Curator Automation
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 15, 2020 · Big Data

Hive Optimization Techniques and Best Practices for Big Data Processing

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, tuning MapReduce jobs, handling data skew in joins, and selecting appropriate storage formats for large‑scale data warehouses.

Big DataData SkewHiveQL
0 likes · 19 min read
Hive Optimization Techniques and Best Practices for Big Data Processing
JD Retail Technology
JD Retail Technology
Jun 15, 2020 · Industry Insights

How JD.com’s Smart Supply Chain Powered the 618 Mega‑Sale: Strategies & Algorithms

The article details JD.com’s Y Business Management Department’s data‑driven, algorithmic approaches to inventory forecasting, replenishment, allocation, and fulfillment during the 618 promotion, describing how big‑data predictions, dynamic programming, ADMM column generation, and cross‑department collaboration optimized costs, reduced stockouts, and enhanced customer experience amid pandemic challenges.

Algorithmic ForecastingBig DataE‑commerce Operations
0 likes · 21 min read
How JD.com’s Smart Supply Chain Powered the 618 Mega‑Sale: Strategies & Algorithms
DataFunTalk
DataFunTalk
Jun 14, 2020 · Big Data

Designing an Offline Big Data Processing Architecture Based on Object Storage

This article presents a comprehensive offline big‑data processing framework that leverages scalable object storage for PB‑level data, details storage and compute engine requirements, compares cost options, describes data pipeline design, and showcases an e‑commerce case study with Spark‑driven analytics.

Big DataCost OptimizationSpark
0 likes · 19 min read
Designing an Offline Big Data Processing Architecture Based on Object Storage
DataFunTalk
DataFunTalk
Jun 14, 2020 · Big Data

Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI

This article describes how iQIYI evaluated various OLAP engines, selected Apache Druid for real‑time analytics, detailed its architecture, identified performance bottlene‑cks in Coordinator, Overlord and indexing, applied configuration and resource‑allocation optimizations, and built a user‑friendly RAP platform to democratize real‑time data analysis.

Apache DruidBig DataData Platform
0 likes · 15 min read
Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI
Beike Product & Technology
Beike Product & Technology
Jun 12, 2020 · Big Data

Design and Implementation of SQL on Streaming (SQL 1.0 → SQL 2.0) in a Real‑Time Computing Platform

This article describes the evolution of a real‑time computing platform from SQL 1.0 built on Spark Structured Streaming to SQL 2.0 powered by Flink‑SQL, covering dynamic tables, continuous queries, dimension‑table joins, cache optimization, DDL extensions, platformization, operational challenges and future roadmap.

Big DataDimension TableFlink
0 likes · 19 min read
Design and Implementation of SQL on Streaming (SQL 1.0 → SQL 2.0) in a Real‑Time Computing Platform
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 12, 2020 · Operations

How VTrace Automates Cloud‑Scale Packet‑Loss Diagnosis

VTrace is an automated diagnostic system that leverages big‑data processing to instantly reconstruct traffic paths and pinpoint the root causes of persistent packet loss in cloud‑scale overlay networks, dramatically simplifying network operations and cutting troubleshooting time from hours to minutes.

Big DataPacket LossSIGCOMM
0 likes · 12 min read
How VTrace Automates Cloud‑Scale Packet‑Loss Diagnosis
Architect
Architect
Jun 11, 2020 · Big Data

Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing

This article explains Apache Flink's distributed system architecture—including JobManager, ResourceManager, TaskManager, and Dispatcher—covers session and job deployment modes, data transfer mechanisms, event‑time handling with watermarks, various state types and backends, scaling strategies, and the checkpoint/savepoint recovery process.

Apache FlinkBig DataEvent Time
0 likes · 15 min read
Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing
Big Data Technology Architecture
Big Data Technology Architecture
Jun 11, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction

This article analyzes workflow scenarios in data warehouse construction, proposes an optimization scheme that abstracts workflow nodes into task and instance layers, and demonstrates how task attributes and generation rules can improve configurability, dependency management, and collaborative development for large‑scale data warehouse projects.

Big DataETLWorkflow
0 likes · 19 min read
Optimizing Workflow in Data Warehouse Construction
DataFunTalk
DataFunTalk
Jun 11, 2020 · Big Data

Real-time Multi-dimensional Analytics and SlimBase State Backend at Kuaishou: Flink Applications and Optimizations

This article presents Kuaishou's extensive use of Apache Flink for real-time multi-dimensional analytics, detailing the platform's architecture, cluster scale, data processing pipelines, the design of a shared state storage engine called SlimBase, and performance improvements achieved through replacing RocksDB with a customized HBase‑based solution.

Big DataFlinkKuaishou
0 likes · 15 min read
Real-time Multi-dimensional Analytics and SlimBase State Backend at Kuaishou: Flink Applications and Optimizations
Alibaba Cloud Developer
Alibaba Cloud Developer
Jun 11, 2020 · Artificial Intelligence

How to Maximize Video Views with a Multi‑Objective Exposure Optimization Model

This article presents a data‑driven approach for allocating limited video exposure resources by building a PV‑click‑CTR (P2C) sensitivity model and a multi‑objective optimization framework that balances overall view volume and fairness across scenes, validated through offline metrics and online bucket tests.

Big Dataalgorithmexposure optimization
0 likes · 9 min read
How to Maximize Video Views with a Multi‑Objective Exposure Optimization Model
58 Tech
58 Tech
Jun 10, 2020 · Big Data

Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

This article details the evolution of 58 Tongcheng Bao's real‑time data warehouse, describing the initial Spark‑Streaming architecture, its limitations, and the redesign using Flink with a layered ODS‑DWD‑DWS‑APP model, data‑quality monitoring, join techniques, and the resulting improvements in latency and accuracy.

Big DataData QualityFlink
0 likes · 9 min read
Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 9, 2020 · Big Data

Comprehensive Overview and Best Practices for Apache Spark Streaming

This article provides a detailed introduction to Spark Streaming, covering its architecture, DStream concepts, initialization, data sources, transformations, windowed aggregations, output operations, checkpointing, fault‑tolerance semantics, deployment, performance tuning, and monitoring for building reliable high‑throughput streaming applications.

Big DataDstreamScala
0 likes · 17 min read
Comprehensive Overview and Best Practices for Apache Spark Streaming
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 7, 2020 · Big Data

A Unified View of SQL‑on‑Hadoop Systems: Architecture, Execution Plans, Optimizations, and Storage Formats

The article provides a comprehensive overview of SQL‑on‑Hadoop query engines such as Hive, Impala, Presto and Spark SQL, comparing their runtime frameworks, core components, compilation steps, optimizer strategies, CPU/IO efficiency techniques, storage formats like ORC and Parquet, and resource management in a unified perspective.

Big DataQuery EngineSQL on Hadoop
0 likes · 24 min read
A Unified View of SQL‑on‑Hadoop Systems: Architecture, Execution Plans, Optimizations, and Storage Formats
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 4, 2020 · Big Data

Kafka for Data Ingestion and Event Distribution: Production‑Consumer and Publish‑Subscribe Patterns

This article explains how Kafka can be used for data ingestion and event distribution by illustrating production‑consumer and publish‑subscribe models, describing core concepts such as topics, partitions and consumer groups, and offering practical design options for handling different event scenarios.

Big DataEvent DistributionKafka
0 likes · 9 min read
Kafka for Data Ingestion and Event Distribution: Production‑Consumer and Publish‑Subscribe Patterns
Big Data Technology Architecture
Big Data Technology Architecture
Jun 4, 2020 · Big Data

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big DataData MigrationHadoop
0 likes · 23 min read
58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration
Tencent Cloud Developer
Tencent Cloud Developer
Jun 4, 2020 · Industry Insights

How Enterprise Architecture Drives Banking Digital Transformation

The article reviews a Tencent Cloud closed‑door forum where industry experts dissect the banking sector’s digital transformation, outlining enterprise‑architecture‑driven strategies, key enabling technologies such as cloud, AI, big data and blockchain, and practical insights from Q&A sessions and ThoughtWorks’ models.

Artificial IntelligenceBankingBig Data
0 likes · 19 min read
How Enterprise Architecture Drives Banking Digital Transformation
Top Architect
Top Architect
Jun 4, 2020 · Big Data

Elasticsearch Deployment and Use Cases in Major Chinese Companies

This article reviews how leading Chinese internet companies such as JD.com, Ctrip, Qunar, 58.com, and Didi have adopted Elasticsearch for large‑scale order search, log analysis, real‑time monitoring, and security, describing the evolution of cluster architectures, shard strategies, multi‑cluster pipelines, and performance optimizations.

Big DataElasticsearchScalability
0 likes · 12 min read
Elasticsearch Deployment and Use Cases in Major Chinese Companies
dbaplus Community
dbaplus Community
Jun 2, 2020 · Big Data

How Cainiao Built a Scalable Real‑Time Data Warehouse with Flink

Facing growing order volumes and strict timeliness demands, Cainiao’s tech team overhauled its real‑time data warehouse by redesigning data models, adopting Flink for streaming computation, upgrading data services, and exploring innovative tools, sharing practical lessons and future directions for large‑scale logistics analytics.

Big DataFlinkLogistics
0 likes · 18 min read
How Cainiao Built a Scalable Real‑Time Data Warehouse with Flink
Tencent Cloud Developer
Tencent Cloud Developer
Jun 2, 2020 · Big Data

Real‑time OLAP Analytics for QQ Music Using ClickHouse and Tencent Cloud EMR

QQ Music’s new real‑time OLAP platform, built on ClickHouse, Superset and Tencent Cloud EMR, ingests petabyte‑scale streaming and batch data with SSD‑backed ZooKeeper, load‑balanced writes, optimized partitions and read/write separation, delivering second‑level query responses that are several times faster than Hive, Presto or SparkSQL and enabling self‑service BI for thousands of users.

Big DataClickHouseOLAP
0 likes · 12 min read
Real‑time OLAP Analytics for QQ Music Using ClickHouse and Tencent Cloud EMR
Big Data Technology & Architecture
Big Data Technology & Architecture
May 31, 2020 · Big Data

Zookeeper Architecture, Roles, and Core Mechanisms

This article provides a comprehensive overview of Apache Zookeeper, detailing its purpose as a distributed coordination service, its key uses such as cluster management, configuration management, naming, distributed locking, and queue management, as well as its architecture, message types, Znode structures, read/write processes, Zab and Fast Paxos protocols, server states, and watcher mechanism.

Big DataConfiguration ManagementDistributed Coordination
0 likes · 14 min read
Zookeeper Architecture, Roles, and Core Mechanisms
Architect
Architect
May 30, 2020 · Big Data

Understanding Flink’s Unified Programming API for Batch and Streaming Jobs

This article examines Apache Flink’s programming model, comparing its batch DataSet API with the streaming DataStream API, detailing class hierarchies, key code examples such as groupBy and job submission, and explaining how both paradigms are unified into a common JobGraph representation.

Batch ProcessingBig DataFlink
0 likes · 9 min read
Understanding Flink’s Unified Programming API for Batch and Streaming Jobs
MaGe Linux Operations
MaGe Linux Operations
May 30, 2020 · Big Data

What Is Kafka? A Deep Dive into Distributed Streaming and Messaging

Kafka is an Apache‑hosted distributed streaming platform that provides high‑throughput, durable, publish‑subscribe messaging, originally developed by LinkedIn; this article explains its core concepts, message system classifications, architecture components, APIs, replication, consumer groups, and guarantees, comparing it with other messaging solutions.

Big DataDistributed StreamingKafka
0 likes · 17 min read
What Is Kafka? A Deep Dive into Distributed Streaming and Messaging
iQIYI Technical Product Team
iQIYI Technical Product Team
May 29, 2020 · Big Data

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

iQiyi’s full‑link automated monitoring platform unifies tracing, metric and log collection with deep offline and real‑time analysis, delivering a DAG‑based call graph, near‑real‑time ingestion of tens of millions of logs, multi‑dimensional alerts and rapid root‑cause diagnosis that cut error‑lookup time by over 50 % and now serves as a core component of the company’s microservice reference architecture.

ArchitectureBig DataMetrics
0 likes · 12 min read
iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation
Big Data Technology & Architecture
Big Data Technology & Architecture
May 28, 2020 · Big Data

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

This article provides a comprehensive guide on detecting Hadoop system bottlenecks, analyzing resource constraints, and applying practical MapReduce performance tuning techniques—including baseline creation, counter analysis, combiner usage, compression, and proper Writable types—to achieve optimal big‑data processing efficiency.

Big DataHadoopMapReduce
0 likes · 11 min read
Hadoop System Bottleneck Detection and MapReduce Optimization Guide
Big Data Technology & Architecture
Big Data Technology & Architecture
May 26, 2020 · Information Security

Step-by-Step Guide to Integrating Kerberos Authentication with the Cloudera Platform

This article provides a comprehensive tutorial on Kerberos fundamentals, its authentication workflow, and detailed procedures for installing, configuring, and enabling Kerberos security on a Cloudera (Hadoop) cluster running on CentOS, including code snippets, configuration files, and post‑deployment testing steps.

AuthenticationBig DataCloudera
0 likes · 17 min read
Step-by-Step Guide to Integrating Kerberos Authentication with the Cloudera Platform
StarRing Big Data Open Lab
StarRing Big Data Open Lab
May 26, 2020 · Cloud Computing

How TCOS 2.0 Empowers Big Data, AI, and Cloud Workloads with Enhanced Compatibility

TCOS 2.0, the container operating system from Transwarp, expands compatibility to Windows, ARM, MIPS, and domestic platforms, adds GPU heterogeneous scheduling, HPA autoscaling, enhanced local storage management, and improved monitoring, providing a robust foundation for big data, AI, and cloud-native applications.

Big DataContainerGPU scheduling
0 likes · 11 min read
How TCOS 2.0 Empowers Big Data, AI, and Cloud Workloads with Enhanced Compatibility
dbaplus Community
dbaplus Community
May 24, 2020 · Big Data

Why Cross-Index Queries Matter in Elasticsearch and How to Implement Them

This article explains why Elasticsearch cross-index queries are essential, outlines their technical principles, showcases classic use cases such as business analytics, big‑data pipelines and log management, and provides practical methods, code examples, and performance considerations for effective implementation.

Big DataCross-Index QueryElasticsearch
0 likes · 10 min read
Why Cross-Index Queries Matter in Elasticsearch and How to Implement Them
macrozheng
macrozheng
May 21, 2020 · Big Data

Mastering Kafka: Core Concepts, Architecture, and Reliability Guarantees

This comprehensive guide covers Kafka's definition, publish/subscribe model, key components, storage mechanisms, producer and consumer strategies, and reliability features such as ACK levels, ISR, and exactly‑once semantics, providing a solid foundation for real‑time big‑data processing.

Big DataDistributed SystemsKafka
0 likes · 16 min read
Mastering Kafka: Core Concepts, Architecture, and Reliability Guarantees
DataFunTalk
DataFunTalk
May 18, 2020 · Artificial Intelligence

Intelligent Investment Research and Financial Sentiment Monitoring with NLP and Big Data

This article describes how advanced natural‑language‑processing, big‑data, and deep‑learning techniques are integrated into an end‑to‑end platform for financial asset management, covering large‑scale bid‑tender text analysis, few‑shot sentiment monitoring, model architectures, data‑enhancement methods, and practical deployment results.

Big DataFinancial AINLP
0 likes · 28 min read
Intelligent Investment Research and Financial Sentiment Monitoring with NLP and Big Data
Selected Java Interview Questions
Selected Java Interview Questions
May 16, 2020 · Big Data

How Reddit Counts Page Views at Scale Using HyperLogLog and Kafka

The article explains Reddit's large‑scale page‑view counting system, detailing its real‑time requirements, the challenges of naive hash‑set storage, and how a hybrid approach using linear probability and HyperLogLog algorithms together with Kafka, Redis, and Cassandra achieves accurate, low‑memory, near‑real‑time analytics.

Big DataHyperLogLogKafka
0 likes · 7 min read
How Reddit Counts Page Views at Scale Using HyperLogLog and Kafka
Big Data Technology & Architecture
Big Data Technology & Architecture
May 16, 2020 · Big Data

Apache Kylin Single‑Node Installation Guide and Troubleshooting

This article provides a comprehensive step‑by‑step guide for installing Apache Kylin on a single machine, covering required software versions, environment variable configuration, Spark dependency handling, main Kylin properties, verification steps, and detailed solutions to common errors such as Zookeeper host issues, HTTP 404, Jackson conflicts, MapReduce jobhistory problems, missing Spark classes, HiveConf errors, and YARN shuffle service configuration.

Apache KylinBig DataHadoop
0 likes · 26 min read
Apache Kylin Single‑Node Installation Guide and Troubleshooting
Didi Tech
Didi Tech
May 15, 2020 · Artificial Intelligence

Key Factors for Effective Data Product Development and Algorithm Engineer Evaluation

Effective data product development hinges on deep business understanding, clear metric decomposition, rigorous model evaluation, and translating technical performance into business impact, while algorithm engineers are best assessed by publication quality, problem significance, algorithmic contribution, and practical interview questions on model tuning and improvement.

Big DataData Productalgorithm evaluation
0 likes · 10 min read
Key Factors for Effective Data Product Development and Algorithm Engineer Evaluation
DataFunTalk
DataFunTalk
May 14, 2020 · Big Data

Building a Real-Time Data Warehouse at Cainiao: Architecture, Model Upgrades, Engine Enhancements, and Service Innovations

This article shares Cainiao's practical experience in constructing a real-time data warehouse, covering the shortcomings of the previous architecture, the evolution of data models, the migration to Flink with advanced features like retraction and timer services, and the modernization of data services and tooling to support high‑throughput logistics scenarios.

Big DataData ServiceFlink
0 likes · 16 min read
Building a Real-Time Data Warehouse at Cainiao: Architecture, Model Upgrades, Engine Enhancements, and Service Innovations
Top Architect
Top Architect
May 14, 2020 · Big Data

Kafka Overview, Architecture, Installation, and Operational Guide

This article provides a comprehensive introduction to Kafka, covering its definition, message queue concepts, architecture components, installation steps, configuration details, startup procedures, operational commands, producer and consumer mechanisms, reliability guarantees, partition strategies, offset management, and performance optimizations.

Big DataConsumerInstallation
0 likes · 22 min read
Kafka Overview, Architecture, Installation, and Operational Guide
Big Data Technology & Architecture
Big Data Technology & Architecture
May 13, 2020 · Big Data

Analysis of Hadoop HDFS Data Read and Write Process

This article explains the underlying principles of Hadoop HDFS read and write operations, detailing how the client interacts with NameNode and DataNodes, the role of FsDataInputStream and FsDataOutputStream, block location retrieval, pipeline replication, and file closure steps.

Big DataData ReadData Write
0 likes · 8 min read
Analysis of Hadoop HDFS Data Read and Write Process
dbaplus Community
dbaplus Community
May 12, 2020 · Cloud Native

Migrating Massive Big‑Data Services to Kubernetes: Lessons from Tongcheng‑eLong

This article details how Tongcheng‑eLong transitioned from Docker‑Host deployments to a Kubernetes‑based platform for hundreds of storage and compute services, covering network integration, IP management, service synchronization, storage strategies, operator development, monitoring, logging, and the challenges and future plans they encountered.

Big DataCloud NativeDocker
0 likes · 17 min read
Migrating Massive Big‑Data Services to Kubernetes: Lessons from Tongcheng‑eLong
Architect
Architect
May 12, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark and Hadoop‑compatible storage to provide efficient ingestion, incremental processing, and multiple query modes such as snapshot, incremental, and read‑optimized for large analytical datasets.

Apache HudiBig DataData Lake
0 likes · 11 min read
An Overview of Apache Hudi: Architecture, Concepts, and Query Types
Tencent Tech
Tencent Tech
May 11, 2020 · Big Data

How Tencent Scaled Elasticsearch to Thousands of Nodes: Core Kernel Optimizations Revealed

This article details Tencent's large‑scale Elasticsearch deployment, covering its massive usage scenarios, the availability, performance, cost and scalability challenges faced, and the comprehensive kernel‑level optimizations—including memory‑based throttling, storage‑model merging, off‑heap caching, rollup and metadata improvements—that enable PB‑level clusters with high reliability and low expense.

Big DataDistributed SystemsElasticsearch
0 likes · 27 min read
How Tencent Scaled Elasticsearch to Thousands of Nodes: Core Kernel Optimizations Revealed
Big Data Technology & Architecture
Big Data Technology & Architecture
May 10, 2020 · Big Data

Apache Beam Overview: Architecture, Programming Model, PCollection, Pipeline and Transform

This article provides a comprehensive introduction to Apache Beam, covering its unified batch‑and‑stream processing architecture, programming model, workflow patterns, Lambda and Kappa architectures, the characteristics of PCollection, pipeline construction, core transforms, I/O handling, and includes practical code examples.

Apache BeamBig DataLambda architecture
0 likes · 14 min read
Apache Beam Overview: Architecture, Programming Model, PCollection, Pipeline and Transform
Java Captain
Java Captain
May 8, 2020 · Big Data

Elasticsearch Adoption and Architecture Cases in Major Chinese Companies

The article surveys how leading Chinese tech firms such as JD Daojia, Ctrip, Qunar, 58.com, and Didi have adopted Elasticsearch for large‑scale search, real‑time analytics, and security, detailing their evolving cluster architectures, shard strategies, data volumes, and supporting services.

ArchitectureBig DataDistributed Systems
0 likes · 11 min read
Elasticsearch Adoption and Architecture Cases in Major Chinese Companies
Architecture Digest
Architecture Digest
May 8, 2020 · Big Data

Elasticsearch Adoption Cases in Chinese Companies: JD.com, Ctrip, Qunar, 58.com, Didi and More

This article surveys how major Chinese internet companies such as JD.com, Ctrip, Qunar, 58.com and Didi have adopted Elasticsearch and the Elastic Stack for high‑volume order queries, log analysis, real‑time monitoring, security analytics, and large‑scale distributed search, describing their architecture evolution, shard strategies, and operational practices.

Big DataLog AnalyticsSearch Architecture
0 likes · 16 min read
Elasticsearch Adoption Cases in Chinese Companies: JD.com, Ctrip, Qunar, 58.com, Didi and More
HomeTech
HomeTech
May 7, 2020 · Big Data

Construction and Evaluation of User Profiles: Identification, Tagging, Storage, and Quality Assessment

This article explains how to build user profiles by distinguishing persona from profile, describing the evolution of ID‑mapping techniques, designing a multi‑layer tag system, implementing statistical, interest, and model tags, storing the data in Hive, HBase, Codis and Elasticsearch, and finally evaluating profile timeliness, coverage and accuracy.

Big Datadata storagedata tagging
0 likes · 11 min read
Construction and Evaluation of User Profiles: Identification, Tagging, Storage, and Quality Assessment
Big Data Technology & Architecture
Big Data Technology & Architecture
May 6, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines

This article provides a comprehensive, hands‑on tutorial for preparing three VMs, installing JDK and Hadoop, configuring core‑site.xml, hdfs‑site.xml, mapred‑site.xml, yarn‑site.xml, setting environment variables, distributing the package, starting HDFS and YARN, and verifying the cluster via web UI and jps commands.

Big DataCluster SetupHDFS
0 likes · 14 min read
Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines
Architecture Digest
Architecture Digest
May 4, 2020 · Databases

HBase Overview, Architecture, Installation, and Basic Shell Operations

This article provides a comprehensive introduction to HBase, covering its origins, key characteristics, architecture components, installation steps, basic shell commands for table management, data structures, read/write processes, and high‑availability configuration within the Hadoop ecosystem.

Big DataHBaseHadoop
0 likes · 14 min read
HBase Overview, Architecture, Installation, and Basic Shell Operations
Architecture Digest
Architecture Digest
May 3, 2020 · Big Data

Kafka Concept Overview

This article provides a comprehensive introduction to Kafka, covering its definition, message‑queue models, architecture components, installation steps, configuration details, producer and consumer mechanisms, reliability guarantees, partition assignment strategies, offset management, and high‑performance read/write techniques.

Big DataConsumerKafka
0 likes · 20 min read
Kafka Concept Overview
Open Source Linux
Open Source Linux
May 2, 2020 · Big Data

Mastering Apache Zookeeper: Core Concepts and Real-World Big Data Applications

Apache Zookeeper is an open‑source coordination service that provides reliable distributed synchronization, configuration management, and naming for big‑data components such as Hadoop, HBase, and Kafka, offering features like hierarchical znode structures, watches, master election, and distributed locks to maintain cluster health.

Apache ZookeeperBig DataCluster Management
0 likes · 17 min read
Mastering Apache Zookeeper: Core Concepts and Real-World Big Data Applications
Didi Tech
Didi Tech
Apr 30, 2020 · Big Data

Didi’s Real‑Time Computing Practices with Apache Flink and StreamSQL

Didi has unified its real‑time computing on Apache Flink, creating an enhanced StreamSQL service with extended DDL, built‑in parsers and UDX, supporting thousands of nodes, millions of jobs, and trillions of daily records, while addressing state management, high availability, multi‑language UDFs, and pursuing real‑time ML and data‑warehouse integration.

Apache FlinkBig DataDidi
0 likes · 13 min read
Didi’s Real‑Time Computing Practices with Apache Flink and StreamSQL
Zhengtong Technical Team
Zhengtong Technical Team
Apr 30, 2020 · Big Data

Design and Performance Optimization of an Intelligent Search System for City Operations Big Data Center

This article describes the background, requirement‑driven prototype design, Elasticsearch‑based query‑DSL selection, and extensive performance tuning—including hardware configuration, indexing parameters, JVM and garbage‑collector adjustments—that enabled real‑time ingestion of hundreds of thousands of records and sub‑second search responses for a city‑wide data platform.

Big DataCluster TuningElasticsearch
0 likes · 12 min read
Design and Performance Optimization of an Intelligent Search System for City Operations Big Data Center
Alibaba Cloud Developer
Alibaba Cloud Developer
Apr 28, 2020 · Big Data

How Alibaba Tests Big Data AI Applications: Six Challenges and Solutions

This article explains how Alibaba's search, recommendation, and advertising platforms handle the unique quality challenges of big‑data AI applications, detailing six major testing problems and the comprehensive strategies—including functional, real‑time, performance, and stability testing—used to ensure reliable online services.

AI testingBig DataDevOps
0 likes · 27 min read
How Alibaba Tests Big Data AI Applications: Six Challenges and Solutions
Suning Technology
Suning Technology
Apr 27, 2020 · Artificial Intelligence

How 20 Retail Tech Companies Navigate Pandemic Challenges and Spot New Opportunities

A comprehensive survey by Suning Retail Technology Institute and the Asia‑Pacific Smart Retail Industry Alliance reveals how twenty retail technology firms faced COVID‑19 disruptions, optimized operations, accelerated digital transformation, and identified emerging growth points in AI, big data, and omnichannel retail.

Artificial IntelligenceBig DataCOVID-19 Impact
0 likes · 17 min read
How 20 Retail Tech Companies Navigate Pandemic Challenges and Spot New Opportunities
dbaplus Community
dbaplus Community
Apr 26, 2020 · Big Data

Evolving from Data Warehouses to Data Middle Platforms: Architecture & Practices

This talk reviews China's big‑data evolution from early enterprise data warehouses to modern data middle platforms, outlines core architectural components, technology selections, data development practices, lifecycle and quality management, and shares practical Q&A insights for building scalable, cost‑effective data infrastructures.

Big DataData ArchitectureData Governance
0 likes · 28 min read
Evolving from Data Warehouses to Data Middle Platforms: Architecture & Practices
Suning Technology
Suning Technology
Apr 26, 2020 · Big Data

How Data Is Redefining Retail: From Online‑Offline Fusion to C2M Growth

Amid the rapid rise of the digital economy, China now treats data as a new production factor, driving innovations such as 5G, AI, and big-data-enabled retail, where companies like Suning integrate online and offline channels, leverage C2M models, and boost efficiency through data-driven operations.

Big DataC2MChina
0 likes · 7 min read
How Data Is Redefining Retail: From Online‑Offline Fusion to C2M Growth
360 Tech Engineering
360 Tech Engineering
Apr 24, 2020 · Operations

Design and Implementation of Instance Configuration Management in xManager

This article explains how xManager implements comprehensive instance configuration management for big‑data services, covering configuration groups, versioning, node‑level differentiation, database schema, installation workflow, group creation, configuration changes, version switching, and deployment scripts.

Big DataConfiguration ManagementDeployment
0 likes · 11 min read
Design and Implementation of Instance Configuration Management in xManager
Qunar Tech Salon
Qunar Tech Salon
Apr 24, 2020 · Databases

Applying Apache Doris in Meituan Food Delivery Data Warehouse: Dual Engine Architecture and Performance Optimizations

The article details Meituan's food‑delivery data warehouse transformation from a MOLAP‑centric design to a dual‑engine (MOLAP + ROLAP) architecture powered by Apache Doris, describing the challenges of massive, mutable data, the technical trade‑offs, and the performance gains achieved through MPP, predicate push‑down, multi‑instance concurrency, colocate joins, and bitmap aggregation.

Apache DorisBig DataMOLAP
0 likes · 16 min read
Applying Apache Doris in Meituan Food Delivery Data Warehouse: Dual Engine Architecture and Performance Optimizations
DataFunTalk
DataFunTalk
Apr 22, 2020 · Big Data

Didi's Real-Time Computing Practices with Apache Flink: Architecture, StreamSQL, and Operational Insights

Senior Didi technology expert Liang Li-yin shares how Didi leverages Apache Flink for large‑scale real‑time computing, covering service architecture, StreamSQL advantages, multi‑cluster management, task control, monitoring, meta‑store integration, challenges, and future plans such as high availability, real‑time ML, and unified batch‑stream processing.

Apache FlinkBig DataReal‑Time Computing
0 likes · 14 min read
Didi's Real-Time Computing Practices with Apache Flink: Architecture, StreamSQL, and Operational Insights
Suning Technology
Suning Technology
Apr 22, 2020 · Big Data

How Suning Turns Data into a New Production Factor to Revolutionize Retail

The recent Chinese policy elevating data to a production factor is illustrated by Suning’s data‑driven retail model, which uses massive user‑tag databases to create precise customer profiles, lower acquisition costs, and boost marketing efficiency, showcasing the strategic importance of big‑data assets in modern commerce.

Big DataDigital Transformationcustomer profiling
0 likes · 5 min read
How Suning Turns Data into a New Production Factor to Revolutionize Retail
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 19, 2020 · Big Data

Understanding the Backpressure Mechanism in Spark Streaming

This article explains Spark Streaming's backpressure mechanism, detailing how batch intervals can cause data accumulation, the role of Receivers versus DirectKafkaInputDStream, configuration to enable backpressure, and the internal workings of RateController, ReceiverRateController, ReceiverSupervisor, BlockGenerator, and rate calculations for Kafka streams.

Big DataKafkaRateController
0 likes · 12 min read
Understanding the Backpressure Mechanism in Spark Streaming
Python Programming Learning Circle
Python Programming Learning Circle
Apr 16, 2020 · Big Data

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

This tutorial demonstrates how to initialize a SparkContext in PySpark, perform simple parallel computations such as temperature conversion and reduction, create a SparkSession to read CSV data, and apply common DataFrame operations like selecting columns, adding new columns, filtering, grouping, and aggregating.

Big DataPySparkSpark
0 likes · 5 min read
Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations
HomeTech
HomeTech
Apr 16, 2020 · Big Data

Home (ZhiJia) Distributed Task Scheduling System Overview

The article presents a comprehensive overview of the Home (ZhiJia) distributed task scheduling system, detailing its background, advantages, technology stack, architecture, core concepts, module responsibilities, IDE integration, and future improvement plans for big‑data processing workflows.

Big DataDistributed SchedulingMaster‑Slave
0 likes · 10 min read
Home (ZhiJia) Distributed Task Scheduling System Overview
dbaplus Community
dbaplus Community
Apr 15, 2020 · Big Data

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

This article details Ctrip's Hadoop evolution, the challenges of expanding across multiple data centers, the evaluation of multi‑cluster versus single‑cluster designs, and the concrete architectural changes, migration tools, bandwidth monitoring, and future plans that enabled a stable cross‑datacenter big‑data platform.

Big DataCross-DataCenterHDFS
0 likes · 19 min read
How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons
DataFunTalk
DataFunTalk
Apr 15, 2020 · Big Data

Apache Flink OLAP Engine: Architecture, Optimizations, and Use Cases

This article presents an in‑depth overview of Apache Flink's new OLAP engine, covering OLAP fundamentals, the three OLAP models, Flink's unified streaming‑batch‑OLAP architecture, performance optimizations, benchmark results, and future development directions.

Apache FlinkBig DataOLAP
0 likes · 11 min read
Apache Flink OLAP Engine: Architecture, Optimizations, and Use Cases
Big Data Technology Architecture
Big Data Technology Architecture
Apr 15, 2020 · Big Data

Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO

This article reviews the evolution of data warehouses from traditional offline models to modern real‑time architectures, presenting detailed case studies of Meituan, NetEase, Zhihu, and OPPO, and discusses layer designs, technology choices such as Flink, Kafka, and storage options, and key lessons for building scalable real‑time warehouses.

Big DataFlinkKafka
0 likes · 13 min read
Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO
Programmer DD
Programmer DD
Apr 12, 2020 · Big Data

Master Elasticsearch: From Basics to SpringBoot Integration and Advanced Queries

This comprehensive guide introduces Elasticsearch fundamentals, its features and use cases, then walks through integrating it with SpringBoot, configuring Maven dependencies, performing index and document operations, and demonstrates a variety of query types and aggregations using both RESTful APIs and Java code examples.

Big DataElasticsearchFull‑Text Search
0 likes · 46 min read
Master Elasticsearch: From Basics to SpringBoot Integration and Advanced Queries
Amap Tech
Amap Tech
Apr 10, 2020 · Backend Development

Platformization of POI Deep Information Integration at Amap: Design and Implementation

Amap transformed its fragmented POI deep‑information pipelines into a unified platform that automates data acquisition, parsing, dimension alignment, specification mapping, and lifecycle management across billions of records, enabling product managers to integrate, debug, and scale diverse content‑provider feeds with real‑time, end‑to‑end control.

BackendBig DataConversion Engine
0 likes · 13 min read
Platformization of POI Deep Information Integration at Amap: Design and Implementation
Meituan Technology Team
Meituan Technology Team
Apr 9, 2020 · Big Data

Dual-Engine MOLAP + ROLAP Architecture with Apache Doris for Meituan Takeaway Data Warehouse

Meituan Takeaway’s data warehouse combines Apache Kylin’s MOLAP cubes for stable dimensions with Apache Doris’s MPP‑driven ROLAP engine to handle changing dimensions, detail queries, and near‑real‑time analytics, achieving millisecond‑level responses, reduced storage/compute costs, and simplifying operations across diverse analytical workloads.

Apache DorisBig DataMOLAP
0 likes · 18 min read
Dual-Engine MOLAP + ROLAP Architecture with Apache Doris for Meituan Takeaway Data Warehouse
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 8, 2020 · Big Data

Common Apache Flink Exceptions and How to Resolve Them

This article enumerates typical Apache Flink deployment, job, and checkpoint errors—such as JDK version issues, resource shortages, task manager timeouts, and state migration problems—and provides practical troubleshooting steps and configuration tips to help engineers quickly diagnose and fix these failures.

Big DataCheckpointException
0 likes · 8 min read
Common Apache Flink Exceptions and How to Resolve Them
ITPUB
ITPUB
Apr 6, 2020 · Big Data

How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases

This article explains the origins and market growth of data lakes, compares them with traditional data warehouses, showcases major implementations like Amazon Galaxy and Club Factory, and provides practical guidance on choosing open‑source or commercial cloud solutions to construct a data lake efficiently while minimizing risk.

AWSBig DataCloud Computing
0 likes · 10 min read
How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 2, 2020 · Big Data

Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets

This article demonstrates how to create Hive tables for student, course, teacher, and score data, generate CSV files, load them into Hive, and provides a comprehensive set of Hive SQL queries covering data retrieval, aggregation, ranking, and statistical analysis for educational datasets.

Big DataQuery Examplesdata-warehouse
0 likes · 21 min read
Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 1, 2020 · Big Data

HBase Cluster Deployment Architecture, Configuration Optimization, and Application Layer Usage

This article details the evolution of HBase cluster deployment from mixed‑hardware/software setups to fully independent clusters, explains hardware and software considerations, presents memory and region planning, outlines key configuration parameters, and provides Spark integration examples for batch and real‑time queries and writes.

Big DataCluster DeploymentConfiguration Optimization
0 likes · 24 min read
HBase Cluster Deployment Architecture, Configuration Optimization, and Application Layer Usage
Xianyu Technology
Xianyu Technology
Mar 31, 2020 · Backend Development

Hermes Push System: Architecture and Design Overview

The Hermes Push System at Xianyu separates push decisions into three coordinated services—Configuration Center for audience and material data, Task Center for timing and orchestration, and Matching Center for real‑time content ranking—leveraging MySQL, ODPS, Flink, SchedulerX, MetaQ and Alibaba’s TPP/IGraph to boost click‑through rates, double user coverage, and achieve record daily active users, while planning to add open‑page notifications and deeper AI personalization.

AlibabaBackendBig Data
0 likes · 12 min read
Hermes Push System: Architecture and Design Overview
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2020 · Databases

HBase Optimization: JVM Tuning, Region Split Policies, BlockCache, and Compaction Strategies

This guide explains how to optimize HBase performance by adjusting JVM memory settings, selecting appropriate garbage collectors, configuring MSLAB and in‑memory compaction, choosing region split policies, tuning BlockCache implementations, and applying suitable compaction policies for different workloads.

Big DataBlockCacheHBase
0 likes · 18 min read
HBase Optimization: JVM Tuning, Region Split Policies, BlockCache, and Compaction Strategies