Tagged articles

3675 articles

Page 23 of 37

Feb 1, 2021 · Big Data

The Origin of Elasticsearch: From a Cooking App Prototype to a Distributed Search Engine

This article recounts how Shay Banon's early cooking‑app project led to the creation of Compass, the evolution of Apache Lucene, and ultimately the development of Elasticsearch—a powerful, distributed search platform built with extensive testing infrastructure and inspired by futuristic data‑interaction concepts.

Apache LuceneBig DataElasticsearch

0 likes · 9 min read

The Origin of Elasticsearch: From a Cooking App Prototype to a Distributed Search Engine

Architects' Tech Alliance

Jan 29, 2021 · Artificial Intelligence

Comprehensive Overview of Machine Learning: Types, Industry Chain, and Key Technologies

This article provides a detailed introduction to machine learning, covering its definition, learning modes such as supervised, unsupervised and reinforcement learning, shallow versus deep learning, the full industry chain from AI chips to cloud and big‑data services, and the major open‑source frameworks and platforms driving the field.

AI chipsBig DataUnsupervised Learning

0 likes · 11 min read

Comprehensive Overview of Machine Learning: Types, Industry Chain, and Key Technologies

Big Data Technology & Architecture

Jan 28, 2021 · Big Data

Understanding Data Lakes: Definitions, Benefits, Architectures, and Technology Choices

Data lakes, emerging since 2020, are centralized repositories that store structured and unstructured data at any scale, offering flexible analytics, but require robust management to avoid becoming data swamps; this article explains definitions, advantages, typical architectures, and compares cloud and open‑source solutions such as AWS Lake Formation, Alibaba Cloud, Delta, Iceberg, and Hudi.

AnalyticsBig DataCloud Storage

0 likes · 13 min read

Understanding Data Lakes: Definitions, Benefits, Architectures, and Technology Choices

JD Cloud Developers

Jan 28, 2021 · Big Data

How JD’s Energy Management Platform Leverages ClickHouse for Real‑Time OLAP at Scale

This article explains how JD’s Energy Management Platform uses ClickHouse as an MPP‑based OLAP engine to ingest, store, and provide multi‑dimensional real‑time analytics on energy consumption data, covering architecture decisions, data pipelines, replication, sharding, and a generic query interface.

Big DataClickHouseOLAP

0 likes · 12 min read

How JD’s Energy Management Platform Leverages ClickHouse for Real‑Time OLAP at Scale

Practical DevOps Architecture

Jan 28, 2021 · Operations

Step-by-Step Guide to Installing Zookeeper and Kafka on a Kubernetes Cluster

This tutorial walks through preparing three Kubernetes nodes, extracting and distributing Zookeeper, configuring its zoo.cfg and myid files, starting and verifying the Zookeeper ensemble, then installing Kafka, adjusting its server.properties, and finally launching Kafka across the cluster.

Big DataInstallationKafka

0 likes · 6 min read

Step-by-Step Guide to Installing Zookeeper and Kafka on a Kubernetes Cluster

dbaplus Community

Jan 27, 2021 · Big Data

How We Upgraded a 1500-Node Flink Cluster to 1.10: Challenges and Solutions

Facing a massive 1500‑node Flink 1.4.2 cluster handling over 12,000 tasks and 30 trillion daily events, we migrated to Flink 1.10, detailing new DDL/Catalog support, SQL enhancements, memory tuning, compatibility patches, extensive testing, and engine optimizations such as task‑load metrics and balanced sub‑task scheduling.

Big DataFlinkPerformance Optimization

0 likes · 13 min read

How We Upgraded a 1500-Node Flink Cluster to 1.10: Challenges and Solutions

Full-Stack Internet Architecture

Jan 27, 2021 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

This article provides a comprehensive overview of Hadoop, covering its origins, core components such as HDFS, MapReduce, and YARN, their architectures, data storage and processing mechanisms, fault‑tolerance features, scheduling strategies, and practical optimization techniques for large‑scale distributed computing.

Big DataHDFSHadoop

0 likes · 33 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

Alibaba Cloud Developer

Jan 25, 2021 · Big Data

Why 2020 Was the Breakthrough Year for Apache Flink’s Ecosystem

In 2020, Apache Flink surged to become the most active Apache project, releasing three major versions that advanced its unified stream‑batch engine, introduced cloud‑native K8s support, expanded AI capabilities with PyFlink, and fostered a thriving Chinese community, solidifying its role as the de‑facto standard for real‑time computing.

AI integrationApache FlinkBig Data

0 likes · 21 min read

Why 2020 Was the Breakthrough Year for Apache Flink’s Ecosystem

Architects' Tech Alliance

Jan 24, 2021 · Big Data

Outline of Distributed Storage Systems: HDFS, GlusterFS, OpenStack Swift, and Ceph

This article outlines the fundamental concepts and key issues of distributed storage, provides an overview of four open‑source distributed file systems—HDFS, GlusterFS, OpenStack Swift, and Ceph—and compares their functionalities, accompanied by illustrative slide images.

Big DataCephGlusterFS

0 likes · 2 min read

Outline of Distributed Storage Systems: HDFS, GlusterFS, OpenStack Swift, and Ceph

Architect

Jan 22, 2021 · Big Data

Understanding Kafka Topic Partitions, Producer Partitioning Strategies, and Consumer Assignment

This article explains how Kafka producers decide which partition to send messages to, how topic partition counts are configured, and how consumer groups assign partitions to instances using default range and round‑robin strategies, with code examples for illustration.

Big DataConsumerKafka

0 likes · 17 min read

Understanding Kafka Topic Partitions, Producer Partitioning Strategies, and Consumer Assignment

Full-Stack Internet Architecture

Jan 22, 2021 · Databases

An Overview of HBase: Architecture, Design Principles, and Performance Characteristics

This article provides a comprehensive introduction to HBase, covering its origins, column‑oriented NoSQL design, storage on HDFS, logical and physical structures, read/write workflows, performance optimizations, and common interview questions for big‑data engineers.

Big DataColumnar DatabaseHBase

0 likes · 24 min read

An Overview of HBase: Architecture, Design Principles, and Performance Characteristics

Didi Tech

Jan 22, 2021 · Big Data

Erasure Coding Practice in HDFS at Didi: Principles, Implementation, and Lessons Learned

Didi migrated HDFS to Hadoop 3.2 and implemented erasure coding—using XOR and Reed‑Solomon RS(6,3) striping—to replace three‑replica storage for cold data, building back‑ported clients, automated conversion tools, and cross‑datacenter backup pipelines, while addressing operational bugs and noting performance trade‑offs.

Big DataDidiHDFS

0 likes · 11 min read

Erasure Coding Practice in HDFS at Didi: Principles, Implementation, and Lessons Learned

DataFunTalk

Jan 22, 2021 · Big Data

Practical Experience of Apache Flink at ByteDance: Architecture, Optimizations, and Future Directions

This article presents ByteDance's real‑world use of Apache Flink, covering the platform's overall architecture, SQL extensions, custom connectors, UI‑driven SQL platform, performance optimizations such as window mini‑batch and custom windows, dimension‑table enhancements, checkpoint recovery improvements, stream‑batch integration, and upcoming roadmap items.

Apache FlinkBig DataByteDance

0 likes · 15 min read

Practical Experience of Apache Flink at ByteDance: Architecture, Optimizations, and Future Directions

Top Architect

Jan 18, 2021 · Big Data

Migrating Over 2 Billion MySQL Records to Google BigQuery Using Kafka

This article details a real‑world solution for migrating more than two billion MySQL records to Google BigQuery by streaming data through Kafka, employing partitioned tables, data filtering, and incremental migration to avoid downtime and reduce storage costs.

Big DataBigQueryData Migration

0 likes · 7 min read

Migrating Over 2 Billion MySQL Records to Google BigQuery Using Kafka

New Oriental Technology

Jan 18, 2021 · Information Security

Kafka Security Authentication and Authorization Configuration Guide (SASL/PLAIN and SASL/SCRAM)

This guide explains Kafka's authentication and authorization mechanisms, covering SASL/PLAIN and SASL/SCRAM setups, JAAS file creation, server property configuration, ACL management, and provides complete Java producer and consumer examples for secure communication.

ACLAuthenticationAuthorization

0 likes · 19 min read

Kafka Security Authentication and Authorization Configuration Guide (SASL/PLAIN and SASL/SCRAM)

Efficient Ops

Jan 17, 2021 · Big Data

Understanding Kafka: Core Concepts, Architecture, and Performance Secrets

This article introduces Kafka’s fundamental role as a messaging system, explains topics, partitions, producers, consumers, replicas, consumer groups, and the controller, and explores its cluster architecture, performance optimizations like sequential writes and zero-copy, providing a comprehensive overview for building scalable data pipelines.

Big DataDistributed SystemsMessage Queue

0 likes · 11 min read

Understanding Kafka: Core Concepts, Architecture, and Performance Secrets

DataFunTalk

Jan 16, 2021 · Big Data

Practical Application of Flink + Kafka at NetEase Cloud Music: Architecture, Platform Design, and Lessons Learned

This article presents a detailed case study of NetEase Cloud Music’s real‑time analytics platform built on Kafka and Flink, covering background, architectural choices, platform‑level design, operational challenges, solutions such as the Magina framework, and a Q&A on reliability and monitoring.

Big DataFlinkKafka

0 likes · 11 min read

Practical Application of Flink + Kafka at NetEase Cloud Music: Architecture, Platform Design, and Lessons Learned

Programmer DD

Jan 16, 2021 · Artificial Intelligence

Can AI Really Predict Employee Work Status? Inside Baidu’s New Patent

The article examines Baidu’s newly filed patent for predicting employee work status, explaining its big‑data‑driven methodology, the company’s claim it’s a talent‑management tool, and the broader debate over workplace surveillance amid the ongoing 996 controversy.

AI predictionBaidu patentBig Data

0 likes · 4 min read

Can AI Really Predict Employee Work Status? Inside Baidu’s New Patent

Big Data Technology & Architecture

Jan 15, 2021 · Big Data

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

This article reviews the evolution, architecture, and key components of major Chinese big‑data platforms—including those of Taobao, Didi, Meituan, 360, Kuaishou, and JD—highlighting data ingestion, storage, processing engines, scheduling systems, and service‑oriented designs that underpin their large‑scale data operations.

Big DataData PlatformHadoop

0 likes · 14 min read

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

DataFunTalk

Jan 15, 2021 · Big Data

Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning

This article presents a detailed case study of how Meituan's in‑store dining sales team identified severe efficiency issues in their Apache Kylin‑based OLAP system, dissected the construction process, and applied a step‑by‑step optimization roadmap—including engine migration, dimension pruning, resource configuration, and Spark‑based layered building—to boost query performance and achieve near‑perfect SLA.

Apache KylinBig DataMeituan

0 likes · 16 min read

Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning

Didi Tech

Jan 14, 2021 · Cloud Computing

Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform

Didi’s Logi‑KafkaManager is a multi‑tenant Kafka cloud platform that consolidates dozens of clusters into a secure, isolated gateway‑driven service offering intuitive web‑based topic management, real‑time metrics visualization, automated diagnostics, quota governance and safe scaling, delivering high internal satisfaction and enterprise commercialization.

Big DataKafkaMonitoring

0 likes · 17 min read

Design and Implementation of Didi's Logi‑KafkaManager Multi‑tenant Kafka Cloud Platform

Meituan Technology Team

Jan 14, 2021 · Big Data

Design and Implementation of an SSD‑Based Application‑Layer Cache Architecture for Kafka in Meituan Data Platform

Meituan built an SSD‑based application‑layer cache for Kafka that bypasses PageCache contention between real‑time and delayed jobs, classifies log segments across SSD and HDD, limits flush rates, and achieves up to 80% latency reduction while guaranteeing stable real‑time consumption.

Big DataKafkaLogSegment

0 likes · 19 min read

Design and Implementation of an SSD‑Based Application‑Layer Cache Architecture for Kafka in Meituan Data Platform

NetEase Smart Enterprise Tech+

Jan 14, 2021 · Big Data

How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM

Yidun’s public-opinion monitoring platform transforms massive raw web data into a unified format by separating dynamic Groovy-script-driven cleaning from static processing, achieving real-time source integration, high throughput, scalability, and high availability while addressing format diversity, team coordination, and performance-flexibility trade-offs.

Big DataETLGroovy

0 likes · 5 min read

How Yidun Achieves Real-Time, High-Performance Public-Opinion Data Cleaning with Groovy and JVM

Architects Research Society

Jan 13, 2021 · Fundamentals

Master Data Management (MDM): Concepts, Business Value, Technical Challenges, and Architectural Considerations

The article explains master data management (MDM) as a framework for creating a single, reliable source of truth, outlines its growing business relevance, discusses key technical challenges such as data governance and scalability, and explores next‑generation architectures involving graph databases, big data, and machine learning.

Big DataData GovernanceGraph Database

0 likes · 10 min read

Master Data Management (MDM): Concepts, Business Value, Technical Challenges, and Architectural Considerations

vivo Internet Technology

Jan 13, 2021 · Big Data

Statistical Monitoring Using Normal Distribution and Boxplot: Theory, Implementation, and API Design

The article explains the origin of the normal distribution, the central limit theorem, and how boxplots identify anomalies, then describes a Java‑based API that partitions data into five median‑centered levels using same‑period and year‑over‑year ratios to automatically detect and classify abnormal trends in daily metrics.

Big DataBoxplotanomaly detection

0 likes · 11 min read

Statistical Monitoring Using Normal Distribution and Boxplot: Theory, Implementation, and API Design

dbaplus Community

Jan 11, 2021 · Databases

Why eBay Switched Its Ad Analytics from Druid to ClickHouse – A Deep Dive

eBay’s ad data platform, originally built on a custom SQL engine and later migrated to Druid, was re‑engineered to use ClickHouse, highlighting challenges such as massive data volume, atomic offline replacements, schema design, compression, and operational simplifications, and demonstrating performance and scalability gains for advertisers.

Ad AnalyticsBig DataClickHouse

0 likes · 18 min read

Why eBay Switched Its Ad Analytics from Druid to ClickHouse – A Deep Dive

Big Data Technology & Architecture

Jan 11, 2021 · Big Data

Evolution of a Real‑Time Data Warehouse Architecture and Practical Lessons

This article recounts the author’s journey building a real‑time data warehouse using Flink, Kafka, Redis, and ClickHouse, describing the initial batch‑oriented setup, successive architectural evolutions, challenges with wide tables and dimension data, and the final OLAP‑centric solution with secondary caching.

Big DataClickHouseFlink

0 likes · 9 min read

Evolution of a Real‑Time Data Warehouse Architecture and Practical Lessons

DataFunSummit

Jan 10, 2021 · Big Data

Business Model and Digital Transformation of Internet Consumer Finance: A Case Study of CMB’s Flash Loan

The article analyzes the business architecture, value proposition, channels, revenue model, core resources, and digital transformation of internet consumer finance using China Merchants Bank’s fast‑approval “Flash Loan” as a case study, highlighting the role of big data, AI, and cloud computing in modern retail lending.

Big DataBusiness ModelDigital Transformation

0 likes · 13 min read

Business Model and Digital Transformation of Internet Consumer Finance: A Case Study of CMB’s Flash Loan

Architects Research Society

Jan 9, 2021 · Big Data

Understanding Transactions in Apache Kafka: Semantics, API, and Practical Guidance

This article explains the purpose, semantics, and design of Apache Kafka’s transaction API, detailing how it enables exactly‑once processing for stream‑processing applications, the role of transaction coordinators and logs, Java API usage, performance considerations, and best‑practice guidance.

Apache KafkaBig DataJava API

0 likes · 19 min read

Understanding Transactions in Apache Kafka: Semantics, API, and Practical Guidance

Amap Tech

Jan 8, 2021 · Industry Insights

How AI‑Driven Data Mining Revives POI Freshness: A Deep Dive into Expired POI Detection

This article examines the technical evolution of POI expiration detection, covering attribute‑based, behavior‑based, and human‑place relationship mining methods, their machine‑learning models, and how they collectively improve map freshness and user experience at scale.

AIBig DataMap Freshness

0 likes · 17 min read

How AI‑Driven Data Mining Revives POI Freshness: A Deep Dive into Expired POI Detection

21CTO

Jan 7, 2021 · Big Data

How Kuaishou Built a Scalable Big Data Service Platform to Eliminate Redundant Development

This article explains Kuaishou's data service platform, detailing the background challenges of high development barriers and duplicated work, the platform's architecture and key technologies such as configuration‑driven development, multi‑mode APIs, data acceleration, and high‑availability mechanisms, and concludes with future directions.

Big DataData AccelerationData Platform

0 likes · 12 min read

How Kuaishou Built a Scalable Big Data Service Platform to Eliminate Redundant Development

360 Tech Engineering

Jan 7, 2021 · Big Data

Overview of the Qirin Big Data Platform Architecture and Core Modules

The article introduces the Qirin big data platform—a one‑stop solution covering resource management, metadata, data ingestion, task development, interactive querying, and self‑service analysis—detailing its modular architecture, typical processing workflow, and future development plans for enterprise‑wide data services.

Big DataData PlatformResource Management

0 likes · 11 min read

Overview of the Qirin Big Data Platform Architecture and Core Modules

vivo Internet Technology

Jan 6, 2021 · Big Data

How HyperLogLog Estimates Cardinality in Massive Data Sets

This article explains the cardinality‑counting problem behind DAU/MAU and unique visitor metrics, compares naïve solutions like Set, Bitmap and Bloom filter, introduces big‑data algorithms such as Linear Counting, LogLog and HyperLogLog, and shows how Redis implements HyperLogLog with dense and sparse storage optimizations.

Big DataCardinalityHyperLogLog

0 likes · 17 min read

How HyperLogLog Estimates Cardinality in Massive Data Sets

DataFunTalk

Jan 6, 2021 · Big Data

Didi's Presto Engine: Architecture, Optimizations, and Operational Practices

This article presents Didi's three‑year experience with Presto, detailing its architecture, low‑latency design, large‑scale deployment, extensive Hive compatibility work, resource isolation, Druid connector integration, usability enhancements, stability engineering, performance tuning, and future directions for the ad‑hoc query engine.

Big DataDistributed SystemsDruid Connector

0 likes · 17 min read

Didi's Presto Engine: Architecture, Optimizations, and Operational Practices

dbaplus Community

Jan 5, 2021 · Big Data

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Facing massive, heterogeneous logs from numerous payment services, Ctrip’s data team designed a unified logging framework that extends log4j2, streams logs via Kafka to HDFS using a customized Camus pipeline, partitions and stores data in ORC for efficient Hive analysis, while addressing format, storage, and performance challenges.

Big DataCamusHadoop

0 likes · 16 min read

How Ctrip Built a Scalable Unified Log Framework for Payment Data

58 Tech

Jan 4, 2021 · Big Data

Building a Real‑Time Data Warehouse with Flink: Architecture, Implementation and Lessons Learned

This article describes how a fast‑growing company built a layered real‑time data warehouse on Flink, detailing the evolution from a simple 1.0 pipeline to a 2.0 architecture with ODS, DWD and ADS layers, dimension joins, exactly‑once sinks, HDFS partitioning, monitoring, and future improvements.

Big DataETLFlink

0 likes · 14 min read

Building a Real‑Time Data Warehouse with Flink: Architecture, Implementation and Lessons Learned

Alibaba Cloud Developer

Jan 4, 2021 · Databases

Why Cloud‑Native Distributed Databases Are the Future of Enterprise Data

The article reviews the evolution of database systems driven by cloud computing, big‑data demands and distributed architectures, highlights Alibaba Cloud’s cloud‑native offerings such as PolarDB and AnalyticDB, and discusses trends, security, and best practices for modern enterprise data platforms.

Alibaba CloudBig DataDatabase Security

0 likes · 14 min read

Why Cloud‑Native Distributed Databases Are the Future of Enterprise Data

DataFunTalk

Jan 3, 2021 · Artificial Intelligence

iQIYI Machine Learning Platform: Development History, Features, and Practical Experience

This article details the evolution of iQIYI's machine learning platform—from its early Javis‑based deep‑learning system to three major versions that introduced visual workflow, distributed scheduling, auto‑tuning, large‑scale training support, model management, and online prediction—while sharing practical lessons and a real anti‑cheat use case.

Big DataModel Managementhyperparameter tuning

0 likes · 13 min read

iQIYI Machine Learning Platform: Development History, Features, and Practical Experience

Java Backend Technology

Jan 2, 2021 · Information Security

Why Your Personal Data Is Worthless: The Dark Reality of Big Data Privacy Leaks

The article exposes how the promise of big‑data convenience masks rampant privacy violations—from celebrity photo leaks and app data sales to weak legal penalties—illustrating that ordinary users’ personal information has become a cheap commodity with little protection.

Big DataChinaData Protection

0 likes · 6 min read

Why Your Personal Data Is Worthless: The Dark Reality of Big Data Privacy Leaks

DataFunTalk

Dec 31, 2020 · Artificial Intelligence

Introduction to Graph Neural Networks and Their Applications in Recommendation Systems

This article introduces graph neural networks, explains their underlying sampling and aggregation mechanisms, and demonstrates how they are applied in large‑scale recommendation scenarios such as video and content feeds at Tencent, highlighting practical results and lessons learned.

Artificial IntelligenceBig DataGraphSAGE

0 likes · 10 min read

Introduction to Graph Neural Networks and Their Applications in Recommendation Systems

Tencent Cloud Developer

Dec 30, 2020 · Big Data

How Alluxio Boosts Tencent Cloud EMR: Cutting Bandwidth by 50% and Accelerating IO‑Intensive Workloads

This article analyzes the challenges of traditional monolithic big‑data architectures, explains how Tencent Cloud EMR integrates Alluxio for compute‑storage separation, presents detailed performance benchmarks showing 20‑50% bandwidth reduction and 5‑40% query speedup, and outlines the specific tuning measures applied.

AlluxioBig DataCloud Computing

0 likes · 10 min read

How Alluxio Boosts Tencent Cloud EMR: Cutting Bandwidth by 50% and Accelerating IO‑Intensive Workloads

JD Tech Talk

Dec 30, 2020 · Databases

Architecture and Application Practice of JD Urban Spatio-Temporal Data Engine (JUST)

The presentation details the design, implementation, and real‑world applications of the JD Urban Spatio‑Temporal Data Engine (JUST), a distributed, scalable database that handles massive, complex spatio‑temporal data with novel storage, indexing, and query techniques, demonstrating high performance and ease of use across smart‑city scenarios.

Big DataGISUrban Computing

0 likes · 26 min read

Architecture and Application Practice of JD Urban Spatio-Temporal Data Engine (JUST)

Alibaba Cloud Developer

Dec 29, 2020 · Fundamentals

What Are the 10 Tech Trends Shaping the Post-Pandemic Era?

Alibaba DAMO Academy outlines ten pivotal technology trends for 2021, ranging from third‑generation semiconductors and quantum computing to AI‑driven drug discovery, cloud‑native IT, data‑intelligent agriculture, and smart city operation centers, highlighting how these innovations will drive post‑pandemic growth.

Artificial IntelligenceBig DataQuantum Computing

0 likes · 9 min read

What Are the 10 Tech Trends Shaping the Post-Pandemic Era?

Youzan Coder

Dec 28, 2020 · Big Data

How Youzan’s BI Platform Turns Massive Data into Interactive Visual Insights

This article explains the design, features, and technical implementation of Youzan’s BI platform, covering its target users, visualization workflow, supported chart types, filtering, permissions, drill‑down, calculated fields, SQL generation logic, and future development directions.

AnalyticsBIBig Data

0 likes · 20 min read

How Youzan’s BI Platform Turns Massive Data into Interactive Visual Insights

Alibaba Terminal Technology

Dec 28, 2020 · Big Data

Unlocking Massive-Scale User Behavior Analysis: From Funnels to Intelligent Links

This talk explores how to conduct user behavior analysis on massive data sets, compares existing analytics tools, and presents Alibaba Dataworks' end‑to‑end solution—including funnel and link visualizations, a big‑data processing architecture, and future intelligent link capabilities—to uncover and resolve user‑experience issues efficiently.

Alibaba CloudBig DataData visualization

0 likes · 16 min read

Unlocking Massive-Scale User Behavior Analysis: From Funnels to Intelligent Links

Big Data Technology & Architecture

Dec 28, 2020 · Big Data

Implementing Historical Slowly Changing Dimension (Chain) Tables with PL/pgSQL

This article explains the concept of historical chain (slowly changing dimension) tables in data warehousing, demonstrates how to create source and target tables, provides a PL/pgSQL stored procedure to handle inserts, updates, and deletions, and shows step‑by‑step testing with sample SQL scripts.

Big DataPL/pgSQLSlowly Changing Dimension

0 likes · 10 min read

Implementing Historical Slowly Changing Dimension (Chain) Tables with PL/pgSQL

dbaplus Community

Dec 27, 2020 · Big Data

How ClickHouse Powers a 700 B‑Row Real‑Time Data Platform at Ctrip

This article details how Ctrip's senior engineering manager leveraged ClickHouse to build a high‑availability, sub‑second response data platform handling nearly 700 billion rows, describing the motivations, architecture, data synchronization processes, performance gains, challenges, and practical recommendations for large‑scale analytics.

Big DataClickHouseData Architecture

0 likes · 28 min read

How ClickHouse Powers a 700 B‑Row Real‑Time Data Platform at Ctrip

Architect

Dec 27, 2020 · Big Data

Optimizing Billion‑Scale Hive Queries: Partitioning, Indexing, Bucketing, Active‑User Segmentation, and Data Structure Refactoring

This article walks through the challenges of querying a 300‑billion‑row Hive table, analyzes why traditional partitioning, indexing, and bucketing fall short, and presents a practical solution that combines active‑user segmentation and a redesigned array‑based data model to cut query time from hours to minutes.

Big DataData Partitioningdata modeling

0 likes · 10 min read

Optimizing Billion‑Scale Hive Queries: Partitioning, Indexing, Bucketing, Active‑User Segmentation, and Data Structure Refactoring

DataFunTalk

Dec 27, 2020 · Information Security

Evolution of 58.com Risk Control Architecture: From Early Stages to Intelligent Auditing

This talk outlines 58.com’s risk control evolution, detailing the platform’s four development stages, the challenges of fraud, fake traffic, and content abuse, and how architecture, algorithms, and operational strategies have been refined to achieve high‑throughput, intelligent auditing.

AIBig DataInformation Security

0 likes · 12 min read

Evolution of 58.com Risk Control Architecture: From Early Stages to Intelligent Auditing

Youzan Coder

Dec 25, 2020 · Big Data

Metadata Governance and Collection in a Data Asset Platform

The platform implements comprehensive metadata governance by extracting, standardizing, and ingesting basic, trend, resource, lineage, and task metadata from offline and real‑time systems via a Kafka‑based SDK, enabling unified storage, monitoring, alerts, and future automation to improve data asset visibility and quality.

Big DataData GovernanceMonitoring

0 likes · 18 min read

Metadata Governance and Collection in a Data Asset Platform

Big Data Technology & Architecture

Dec 24, 2020 · Big Data

Big Data Interview Questions and Solutions for Massive Data Processing

This article presents ten big‑data interview problems, each describing a scenario such as finding the most frequent IP, top‑K queries, word frequency counting under memory limits, and techniques like hashing, bitmap, trie, heap, and external sorting to solve them efficiently.

AlgorithmsBig DataHashing

0 likes · 11 min read

Big Data Interview Questions and Solutions for Massive Data Processing

Big Data Technology & Architecture

Dec 24, 2020 · Big Data

Common Techniques for Processing Massive Data Sets

This article summarizes a range of practical methods—including Bloom filters, hashing, bit‑maps, heaps, bucket partitioning, database indexes, inverted indexes, external sorting, trie trees, and MapReduce—that are commonly used to handle, deduplicate, and query extremely large data volumes in big‑data applications.

Big DataHashingHeap

0 likes · 11 min read

Common Techniques for Processing Massive Data Sets

Code Ape Tech Column

Dec 23, 2020 · Fundamentals

Technical Concepts Illustrated Through Relationship Analogies

The article humorously maps various relationship scenarios to core IT concepts such as backup strategies, high‑availability mechanisms, scaling methods, security measures, cloud services, and big‑data techniques, providing an engaging overview of fundamental system design principles.

BackupBig DataCloud Computing

0 likes · 8 min read

Technical Concepts Illustrated Through Relationship Analogies

dbaplus Community

Dec 22, 2020 · Big Data

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

This article details how eBay's ADI Hadoop team tackled a massive 10 PB, 10‑million‑file migration by optimizing DistCp with Fastcopy, load‑balancing, ACL handling, and failure recovery, ultimately completing the transfer within a two‑hour window while preserving cluster stability and performance.

Big DataDistcpHDFS

0 likes · 16 min read

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

Architect

Dec 22, 2020 · Big Data

Dimensional Modeling in Data Warehousing: Concepts, Theory, and Practical Example

This article explains data warehouse fundamentals, reviews classic warehouse models such as ER, dimensional, Data Vault and Anchor, then dives deep into dimensional modeling concepts, star and snowflake schemas, and demonstrates a practical e‑commerce scenario with SQL examples and trade‑offs.

Big DataETLStar Schema

0 likes · 11 min read

Dimensional Modeling in Data Warehousing: Concepts, Theory, and Practical Example

21CTO

Dec 21, 2020 · Big Data

5 Emerging Big Data Trends Shaping Business, Health, and Climate in 2021

This article outlines five key big‑data trends for 2021—including the rise of augmented analytics, the convergence of big data with blockchain, the growing importance of knowledge graphs, data‑driven health innovations, and climate‑focused analytics—highlighting their impact on organizations and future technological landscapes.

Big DataBlockchainKnowledge graph

0 likes · 8 min read

5 Emerging Big Data Trends Shaping Business, Health, and Climate in 2021

Didi Tech

Dec 21, 2020 · Big Data

HBase Availability and Latency Optimizations: Replication‑Based Multi‑Read and ZGC Adoption

To overcome HBase’s weak availability and GC‑induced latency spikes, the DiDi team introduced a replication‑based client multi‑read (hedged‑read) mechanism and migrated to the Z Garbage Collector, which together dramatically cut maximum and 99.9th‑percentile latencies while keeping services online during region disruptions.

Big DataHBaseLow latency

0 likes · 12 min read

HBase Availability and Latency Optimizations: Replication‑Based Multi‑Read and ZGC Adoption

Full-Stack Internet Architecture

Dec 20, 2020 · Big Data

Using Flinkx for Data Synchronization in Sharded MySQL Environments

This article explains how to leverage Flinkx and Flink Stream API to create a unified data‑sync task that extracts data from sharded MySQL tables, splits the workload, and pushes it to an MQ cluster, while detailing the underlying InputFormat and Reader architecture.

Big DataFlinkFlinkX

0 likes · 8 min read

Using Flinkx for Data Synchronization in Sharded MySQL Environments

Python Crawling & Data Mining

Dec 19, 2020 · Big Data

Scrape and Analyze Bilibili’s “马保国” Videos with Python – A Complete Guide

This tutorial shows how to use Python to fetch data from Bilibili’s “马保国” channel via its public API, extract video metadata, clean and visualize 14,000 records, and generate insights such as top‑viewed videos and a comment word cloud.

Big DataBilibiliPython

0 likes · 5 min read

Scrape and Analyze Bilibili’s “马保国” Videos with Python – A Complete Guide

Youzan Coder

Dec 18, 2020 · Big Data

Design and Implementation of a Configurable Real-Time Rule Engine for Live‑Streaming Product Audits

The paper presents a configurable real‑time rule engine for live‑streaming product audits that decouples data aggregation from rule execution, uses QLExpress for dynamic conditions, supports Dubbo and HTTP sources, and enables safe gray‑release updates, cutting the rule‑change cycle from weeks to near‑real‑time.

Big DataQLExpressconfiguration

0 likes · 8 min read

Design and Implementation of a Configurable Real-Time Rule Engine for Live‑Streaming Product Audits

Laiye Technology Team

Dec 18, 2020 · Big Data

Comprehensive Overview of Laiye Technology's Business Intelligence Ecosystem

This article provides a detailed, end‑to‑end description of Laiye Technology's BI ecosystem, covering its background, development stages, data acquisition, transmission, transformation, loading, modeling, storage layers, statistical analysis, real‑time metrics, visualization, and future challenges, illustrating how the company builds a scalable, cloud‑native data‑driven platform.

AnalyticsBIBig Data

0 likes · 22 min read

Comprehensive Overview of Laiye Technology's Business Intelligence Ecosystem

Alibaba Cloud Developer

Dec 17, 2020 · Big Data

Why GraphScope is Revolutionizing Large-Scale Graph Computing for AI and Big Data

GraphScope, an open‑source one‑stop platform from Alibaba DAMO Academy, unifies interactive queries, graph analytics, and graph learning on massive, rapidly evolving graphs, offering high‑performance distributed memory management, Gremlin optimization, and seamless Python integration to tackle real‑world AI and big‑data challenges.

Big DataDistributed SystemsOpen-source

0 likes · 21 min read

Why GraphScope is Revolutionizing Large-Scale Graph Computing for AI and Big Data

MaGe Linux Operations

Dec 17, 2020 · Information Security

Mastering Apache Ranger: Secure Hadoop Data Access with Real‑World Examples

This guide explains Apache Ranger’s role as a centralized security framework for Hadoop, detailing its core features, architecture, workflow, policy creation, auditing, field‑level masking, row‑level filtering, and how to automate policy management via its REST API and Java code.

Apache RangerBig DataData access control

0 likes · 13 min read

Mastering Apache Ranger: Secure Hadoop Data Access with Real‑World Examples

Bitu Technology

Dec 16, 2020 · Big Data

Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support

This article explains how Tubi customizes Spark SQL using lightweight macro‑based extensions to simplify column exclusion, JSON path queries, and other complex operations without modifying Spark's source code, detailing the two‑stage processing, example macros, and benefits for big‑data workloads.

Big DataCustom SQLMacros

0 likes · 9 min read

Customizing Spark SQL with Macro‑Based Extensions for Column Exclusion and JSON Path Support

macrozheng

Dec 15, 2020 · Big Data

How Kafka Achieves Million‑TPS Through Sequential I/O, MMAP, and Zero‑Copy

Kafka can sustain millions of transactions per second by writing data sequentially to disk, leveraging memory‑mapped files, employing zero‑copy DMA transfers, and batching messages, each technique reducing I/O overhead and CPU involvement, which together enable its high‑throughput performance in big‑data pipelines.

Big DataHigh ThroughputKafka

0 likes · 11 min read

How Kafka Achieves Million‑TPS Through Sequential I/O, MMAP, and Zero‑Copy

Youzan Coder

Dec 15, 2020 · Industry Insights

How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis

This article details Youzan's end‑to‑end construction of a unified data‑center cost billing system, covering background goals, multi‑type cost support, SDK‑based information collection, cost quantification for offline, real‑time and platform tools, full‑business coverage, multi‑dimensional analysis models, operational rollout, and future plans.

Big DataData PlatformIndustry Insights

0 likes · 19 min read

How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis

Programmer DD

Dec 10, 2020 · Artificial Intelligence

Discover Didi’s 40+ Open‑Source Projects in AI, Big Data & Cloud

DiDi’s open‑source portfolio, now exceeding 40 projects, spans AI runtimes, speech recognition, traffic analytics, middleware, big‑data loaders, monitoring tools, mobile frameworks, and frontend libraries, offering developers ready‑to‑use solutions for edge AI, intelligent transportation, data processing, and system reliability.

Artificial IntelligenceBig DataMobile Development

0 likes · 23 min read

Discover Didi’s 40+ Open‑Source Projects in AI, Big Data & Cloud

Youzan Coder

Dec 9, 2020 · Big Data

Youzan Big Data Technology Salon: Practices in Data Cost Governance, Apache Iceberg, Flink, and Data-Driven Growth

The Youzan Big Data Technology Salon brought together Youzan, NetEase and Didi to share practical approaches for cutting data‑infrastructure costs, building an Apache Iceberg‑based data lake, scaling Flink real‑time workloads, and creating a data‑driven growth platform that leverages tracking, A/B testing and analytics.

Apache IcebergBig DataData Cost Governance

0 likes · 5 min read

Youzan Big Data Technology Salon: Practices in Data Cost Governance, Apache Iceberg, Flink, and Data-Driven Growth

DataFunTalk

Dec 8, 2020 · Artificial Intelligence

Financial Big Data Risk Control Models: Techniques, Applications, and COVID‑19 Challenges

This article presents a comprehensive overview of financial big‑data risk control models at Du Xiaoman, covering traditional scoring cards, AI‑driven time‑series and text processing, graph‑based networks, model interpretability, probability calibration, stability analysis, and the specific challenges introduced by the COVID‑19 pandemic.

Artificial IntelligenceBig DataCredit Scoring

0 likes · 14 min read

Financial Big Data Risk Control Models: Techniques, Applications, and COVID‑19 Challenges

Xianyu Technology

Dec 8, 2020 · Big Data

Supply-Demand Modeling and Category Optimization for the Idle Second-Hand Market

The article describes a supply‑demand modeling framework for the idle second‑hand market that extracts and structures product attributes, builds a decision‑tree‑based index from price, inventory, search‑hotspot and demand‑activation sub‑models, and uses the index to optimize category allocation, boost scarce supply, and drive overall growth.

Big DataMarket AnalysisProduct Modeling

0 likes · 7 min read

Supply-Demand Modeling and Category Optimization for the Idle Second-Hand Market

Tencent Cloud Developer

Dec 7, 2020 · Big Data

Searchable Snapshots in Elasticsearch 7.10: Features, Usage, and Future Outlook

Elasticsearch 7.10 adds searchable snapshots, letting users query indices stored directly in remote repositories such as S3 or COS, which halves storage costs, decouples storage from compute, supports manual mounting and ILM cold‑phase policies, and promises future full storage‑compute separation without local caching.

Big DataData TieringElasticsearch

0 likes · 12 min read

Searchable Snapshots in Elasticsearch 7.10: Features, Usage, and Future Outlook

JavaEdge

Dec 5, 2020 · Big Data

How Kafka Chooses Its Partition Leaders: ZAB, Raft, and Controller Election Explained

This article explains the leader election mechanisms used in big‑data systems—ZAB in Zookeeper, Raft’s role‑based election, their drawbacks such as split‑brain and ZooKeeper overload, and how Kafka’s controller‑based design solves these issues with efficient partition leader selection.

Big DataKafkaRaft

0 likes · 7 min read

How Kafka Chooses Its Partition Leaders: ZAB, Raft, and Controller Election Explained

DataFunSummit

Dec 1, 2020 · Artificial Intelligence

Building an AI Ecosystem with Flink: AI Flow Architecture, Components, and Applications

This article explains how Flink enables end‑to‑end AI workflows through the AI Flow platform, covering the Lambda architecture background, AI task pipeline stages, the reasons for choosing Flink, AI Flow’s graph model, core services, integration with ML pipelines, and real‑world advertising recommendation use cases.

AI FlowAI PipelineBig Data

0 likes · 12 min read

Building an AI Ecosystem with Flink: AI Flow Architecture, Components, and Applications

Huawei Cloud Developer Alliance

Dec 1, 2020 · Databases

Why Time Series Databases Are Crucial for IoT and Cloud Monitoring

This article explains the fundamentals, application scenarios, key requirements, and open‑source options for time series databases, highlighting how GaussDB (For Influx) addresses high‑performance writes, massive timelines, low storage cost, and elastic scaling for IoT and cloud monitoring workloads.

Big DataGaussDBInfluxDB

0 likes · 10 min read

Why Time Series Databases Are Crucial for IoT and Cloud Monitoring

DataFunTalk

Nov 30, 2020 · Fundamentals

DataFunTalk Annual Conference – Full Program and Speaker Details

The DataFunTalk year‑end conference will be held online on December 19‑20, featuring over 90 speakers across multiple forums covering recommendation algorithms, knowledge graphs, AI, big data, security, and product development, with detailed session schedules, speaker bios, and registration information.

AIBig DataKnowledge graph

0 likes · 76 min read

DataFunTalk Annual Conference – Full Program and Speaker Details

JD Tech Talk

Nov 30, 2020 · Big Data

Scalable Time Series Similarity Search in Big Data: Partitioning, Dimensionality Reduction, and LSH Approaches

This article examines the challenges of performing time‑series similarity queries on massive datasets and presents three scalable solutions—partition‑based indexing, dimensionality‑reduction using MinHash, and a combined approach with Locality Sensitive Hashing—to reduce computation while preserving similarity accuracy.

Big DataLSHMinhash

0 likes · 10 min read

Scalable Time Series Similarity Search in Big Data: Partitioning, Dimensionality Reduction, and LSH Approaches

ITFLY8 Architecture Home

Nov 28, 2020 · Fundamentals

What 19 Core Topics Every Software Architect Must Master

This article outlines a comprehensive knowledge framework for software architects, covering nineteen essential areas such as responsibilities, foundational concepts, internet system challenges, distributed caching, messaging, load balancing, performance testing, operating systems, algorithms, networking, database design, JVM internals, flash-sale systems, microservices, domain‑driven design, security, high‑availability, big data, and blockchain.

Big DataSoftware ArchitectureSystem Design

0 likes · 6 min read

What 19 Core Topics Every Software Architect Must Master

dbaplus Community

Nov 28, 2020 · Operations

How a Chinese City Bank Integrated DevOps, AI, and Big Data to Transform Operations

This case study details how a city‑bank leveraged DevOps and ITIL integration, AI‑driven monitoring, and Spark‑based big‑data analytics to build a unified development‑testing‑operations platform, improve service availability, shorten deployment cycles, and achieve near‑99.99% system uptime across its core banking services.

AIBig DataDevOps

0 likes · 17 min read

How a Chinese City Bank Integrated DevOps, AI, and Big Data to Transform Operations

Beike Product & Technology

Nov 27, 2020 · Artificial Intelligence

Mining User Housing Preference Schemes with Supply‑Filtered Tree‑Based Methods

The article proposes a supply‑filtered, tree‑based approach to discover multi‑dimensional user housing preference schemes, contrasting fixed‑length preference mining methods, and details algorithmic modules such as split‑point search, similarity calculation, split suppression, and user clustering to improve interpretability and offline applicability.

AIBig Datahousing recommendation

0 likes · 13 min read

Mining User Housing Preference Schemes with Supply‑Filtered Tree‑Based Methods

Practical DevOps Architecture

Nov 27, 2020 · Big Data

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

This tutorial provides a complete walkthrough for downloading Hadoop 2.8.2, setting up a three‑node master‑slave cluster, configuring core, HDFS, MapReduce and YARN settings, creating required directories, distributing the installation, starting the services, verifying the cluster status, and finally shutting it down.

Big DataCluster SetupHDFS

0 likes · 5 min read

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

dbaplus Community

Nov 26, 2020 · Big Data

Silicon Valley's Data Middle Platform Secrets: EA, Twitter, Airbnb, Uber

This article examines how leading Silicon Valley companies such as EA, Twitter, Airbnb, and Uber design and operate data middle platforms—detailing their architectures, data collection pipelines, standardization efforts, real‑time and batch processing, and the business impact of shared data capabilities.

Big DataData ArchitectureData Platform

0 likes · 25 min read

Silicon Valley's Data Middle Platform Secrets: EA, Twitter, Airbnb, Uber

DataFunTalk

Nov 26, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

This article details the evolution of 58.com’s commercial data warehouse across three phases—1.0, 2.0, and 3.0—covering its scale, four‑layer architecture, migration from legacy Hadoop‑MapReduce pipelines to Flume/Kafka and Flink streaming, code optimizations, monitoring, and productization for real‑time business insights.

Big DataETLHadoop

0 likes · 9 min read

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

Big Data Technology Architecture

Nov 25, 2020 · Big Data

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

This article explains the concept and benefits of data lakes, outlines the storage and acceleration challenges they pose, presents an ideal checklist for selecting a data lake solution, and evaluates Alibaba Cloud's JindoFS against that checklist, highlighting its capabilities for big‑data and AI workloads.

Alibaba CloudBig DataData Lake

0 likes · 9 min read

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

dbaplus Community

Nov 24, 2020 · Databases

How ClickHouse Enables Millisecond‑Scale User Profiling for Hundreds of Millions

This article explains how Suning built a high‑performance user‑tag platform on ClickHouse, replacing Elasticsearch with bitmap‑based storage and a new architecture that delivers sub‑second profiling queries for over 600 million users, detailing the design, implementation, and future enhancements.

Big DataClickHouseOLAP

0 likes · 14 min read

How ClickHouse Enables Millisecond‑Scale User Profiling for Hundreds of Millions

DataFunTalk

Nov 24, 2020 · Artificial Intelligence

Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms

This presentation explains how knowledge graphs serve as the foundation for new‑infrastructure initiatives, detailing the evolution of AI from perception to cognition, the role of big‑data centers, DIKW modeling, intelligent data governance, and the construction of a cognitive AI middle‑platform for industry applications.

AI infrastructureArtificial IntelligenceBig Data

0 likes · 18 min read

Building Next‑Generation Data Intelligence Infrastructure with Knowledge Graphs: From New Infrastructure to Cognitive AI Platforms

Big Data Technology Architecture

Nov 24, 2020 · Big Data

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

This article shares practical experiences of building an industrial data middle‑platform with DeltaLake, covering heterogeneous distributed stream handling, batch‑stream unified analytics, and transactional/algorithm support to improve data timeliness, reliability, and operational efficiency in manufacturing environments.

Batch-Stream FusionBig DataDeltaLake

0 likes · 11 min read

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

Alibaba Cloud Developer

Nov 23, 2020 · Big Data

How Alibaba’s CCO Built a Cloud‑Native Real‑Time Data Warehouse with Hologres

Alibaba’s Customer Experience (CCO) team transformed its real‑time data platform by evolving from a Lambda‑style database architecture to a cloud‑native real‑time data warehouse powered by Hologres and Flink, achieving higher throughput, lower latency, reduced costs, and self‑service analytics for massive Double‑11 traffic.

AlibabaBig DataFlink

0 likes · 15 min read

How Alibaba’s CCO Built a Cloud‑Native Real‑Time Data Warehouse with Hologres

Alibaba Cloud Developer

Nov 22, 2020 · Big Data

How Flink’s Stream‑Batch Integration Powered Alibaba’s Record‑Breaking Double‑11

Alibaba’s 2020 Double‑11 achieved unprecedented real‑time processing of 4 billion records per second and 7 TB of data per second using Flink, showcasing the stability, performance and efficiency of its stream‑batch unified architecture across diverse business scenarios.

AlibabaBatch ProcessingBig Data

0 likes · 15 min read

How Flink’s Stream‑Batch Integration Powered Alibaba’s Record‑Breaking Double‑11

Big Data Technology & Architecture

Nov 21, 2020 · Big Data

Big Data Performance Testing: Objectives, Timing, Steps, Tools, and Optimization

This article outlines the purpose, timing, procedures, tools, and optimization techniques for big data performance testing, providing detailed guidance on test planning, execution, metric collection, and analysis to ensure reliable and efficient big data system deployments.

Big DataHadoopSpark

0 likes · 7 min read

Big Data Performance Testing: Objectives, Timing, Steps, Tools, and Optimization

Alibaba Cloud Developer

Nov 19, 2020 · Databases

How AnalyticDB Powers Double 11: Cloud‑Native Data Warehouse Innovations

AnalyticDB, a cloud‑native MySQL‑compatible data warehouse, delivered extreme performance during Double 11 by handling billions of orders with ultra‑high write TPS, while introducing compute‑storage separation, hot‑cold tiering, resource groups, elastic scaling and intelligent optimization to meet demanding real‑time analytics workloads.

AnalyticDBBig DataResource Groups

0 likes · 17 min read

How AnalyticDB Powers Double 11: Cloud‑Native Data Warehouse Innovations

Java Architect Essentials

Nov 19, 2020 · Artificial Intelligence

Overview of Didi’s Open‑Source Projects Across AI, Big Data, Operations, Mobile and Frontend

This article presents a comprehensive catalog of more than 40 open‑source projects released by Didi, covering AI runtimes, speech and NLP engines, big‑data loaders, middleware, mobile frameworks, frontend UI libraries and various operational tools, each with a brief description and a GitHub link.

AIBig DataDidi

0 likes · 18 min read

Overview of Didi’s Open‑Source Projects Across AI, Big Data, Operations, Mobile and Frontend

Meituan Technology Team

Nov 19, 2020 · Big Data

Optimizing Apache Kylin for High‑Performance OLAP in Meituan's Sales System

Meituan’s sales system “Qingtian” boosted OLAP performance by migrating Apache Kylin’s build engine from MapReduce to Spark, consolidating Hive files, refining dictionary creation, applying a By‑layer algorithm, and bulk‑loading cuboid files to HBase, cutting resource consumption and halving build time, ultimately reaching a 100 % SLA.

Apache KylinBig DataMeituan

0 likes · 15 min read

Optimizing Apache Kylin for High‑Performance OLAP in Meituan's Sales System

Tencent Tech

Nov 19, 2020 · Cloud Computing

How Tencent Built a Massive Cloud Storage System to Power QQ Album and Beyond

This article chronicles Tencent's journey from the early development of the TFS distributed storage platform to large‑scale data migrations, flexible bandwidth strategies, and the creation of the cloud‑native YottaStore, illustrating how a small architecture team solved massive storage challenges for billions of users.

Big DataCloud StorageData Migration

0 likes · 15 min read

How Tencent Built a Massive Cloud Storage System to Power QQ Album and Beyond

DeWu Technology

Nov 19, 2020 · Operations

HBase Operations and Use Cases for High‑Concurrency E‑commerce

In this talk, Yun Jin explains how HBase’s petabyte‑scale, horizontally‑scalable architecture—built on Hadoop, HMaster, RegionServers, and Zookeeper—enables e‑commerce platforms to handle extreme promotion‑day traffic by supporting high‑throughput reads/writes, time‑series monitoring, massive order storage, and robust HA, while covering essential table operations, monitoring, and troubleshooting techniques.

Big DataHBaseMonitoring

0 likes · 6 min read

HBase Operations and Use Cases for High‑Concurrency E‑commerce

JD Retail Technology

Nov 19, 2020 · Big Data

Building JD's Enterprise-wide Big Data Platform: Architecture, Stages, and Challenges

This article summarizes Bao Yongjun’s presentation on JD.com’s end‑to‑end big data platform, covering its strategic value, industry trends, architectural design, development phases from scale‑out to intelligent real‑time processing, and future directions for a cloud‑native, AI‑driven data ecosystem.

Big DataData GovernanceJD.com

0 likes · 16 min read

Building JD's Enterprise-wide Big Data Platform: Architecture, Stages, and Challenges

Java High-Performance Architecture

Nov 18, 2020 · Big Data

Why Pulsar Might Outperform Kafka: Key Advantages and Drawbacks

This article examines Apache Pulsar, an open‑source messaging platform created by Yahoo, compares it with Kafka by outlining Kafka’s common pain points, highlights Pulsar’s multi‑tenant architecture, layered storage, built‑in functions, and security features, and discusses the trade‑offs of each solution.

Apache PulsarBig DataDistributed Systems

0 likes · 6 min read

Why Pulsar Might Outperform Kafka: Key Advantages and Drawbacks

JD Tech Talk

Nov 17, 2020 · Databases

JUST Engine: Novel Spatio‑Temporal Indexes and Data Models for Large‑Scale Urban Data Management

The article introduces the JUST engine, a spatio‑temporal data platform that extends GeoMesa with three new indexes (Z2T, XZ2T, time_range), defines nine common and three specialized data models, provides default indexing strategies, and offers detailed SQL usage guidelines for efficient querying of massive urban datasets.

Big DataGeoMesaJUST engine

0 likes · 25 min read

JUST Engine: Novel Spatio‑Temporal Indexes and Data Models for Large‑Scale Urban Data Management

Big Data Technology & Architecture

Nov 16, 2020 · Big Data

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, how to recognize its symptoms such as stuck reducers or OOM executors, and presents practical strategies—including business‑level adjustments, code refactoring, and platform‑specific tuning—to mitigate the problem.

Big DataHadoopSpark

0 likes · 13 min read

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

Alibaba Cloud Native

Nov 16, 2020 · Cloud Native

What’s New in Fluid 0.4? DataLoad, Small‑File Boost, HDFS Support & Multi‑Dataset Deployment

Fluid 0.4 introduces a DataLoad custom resource for declarative data pre‑warming, enhances support for massive small‑file datasets, adds HDFS‑compatible access for Spark and other big‑data frameworks, and enables mixed‑deployment of multiple datasets on a single node, all backed by significant performance gains.

AIAlluxioBig Data

0 likes · 8 min read

What’s New in Fluid 0.4? DataLoad, Small‑File Boost, HDFS Support & Multi‑Dataset Deployment

DataFunSummit

Nov 15, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink

This article details the three‑stage evolution of 58.com’s commercial data warehouse, describing its massive scale, four‑layer architecture, technical challenges, migrations from MapReduce to Hive and Flink, real‑time streaming upgrades, and the resulting improvements in stability, accuracy, and timeliness.

Big DataData ArchitectureFlink

0 likes · 10 min read

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink