Tagged articles
3675 articles
Page 28 of 37
macrozheng
macrozheng
Dec 20, 2019 · Big Data

How to Supercharge Elasticsearch for Billion‑Row Queries: Practical Optimization Guide

This article explains the architecture of Elasticsearch and Lucene, outlines common performance bottlenecks, and provides concrete indexing and search optimization techniques—including bulk writes, shard routing, doc values tuning, and pagination strategies—to achieve sub‑second query responses on billions of records.

Big DataElasticsearchPerformance Tuning
0 likes · 14 min read
How to Supercharge Elasticsearch for Billion‑Row Queries: Practical Optimization Guide
Qunar Tech Salon
Qunar Tech Salon
Dec 20, 2019 · Big Data

Understanding Flink Cluster Startup and Job Execution Process

This article explains the architecture of a Flink cluster, detailing the startup procedures for JobManager and TaskManager, the three deployment modes, and the end‑to‑end flow of a Flink job from client code through StreamGraph, JobGraph, ExecutionGraph to the physical execution on TaskManagers.

Big DataCluster ArchitectureFlink
0 likes · 10 min read
Understanding Flink Cluster Startup and Job Execution Process
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 19, 2019 · Big Data

Apache Kafka 2.4.0 Release: New Features and Improvements

Apache Kafka 2.4.0 introduces a range of new capabilities—including consumer replica fetching, incremental cooperative rebalancing, MirrorMaker 2.0, a new Java authorization API, KTable non‑key joins, administrative replica reassignment, protected REST endpoints, and offset deletion—along with numerous performance and stability improvements.

Apache KafkaBig DataDistributed Systems
0 likes · 3 min read
Apache Kafka 2.4.0 Release: New Features and Improvements
vivo Internet Technology
vivo Internet Technology
Dec 18, 2019 · Big Data

Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design

The article surveys modern big‑data architecture, contrasting Lambda and Kappa models, highlights common governance and integration pain points, and proposes an end‑to‑end platform featuring unified metadata, stream‑batch processing, one‑click ingestion, standardized modeling, intelligent query abstraction, and a comprehensive development IDE.

Big DataData PlatformETL
0 likes · 13 min read
Comprehensive Overview of Big Data Architecture, Lambda/Kappa Models, and End-to-End Data Platform Design
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 17, 2019 · Big Data

Understanding Flink Sliding Windows and Performance Optimizations

This article explains Flink's sliding window mechanism, shows how the WindowAssigner and WindowOperator work with code examples, analyzes the performance impact of fine‑grained sliding windows, and proposes a practical workaround using tumbling windows combined with external storage such as Redis for efficient PV/UV aggregation.

Big DataFlinkPerformance Optimization
0 likes · 8 min read
Understanding Flink Sliding Windows and Performance Optimizations
DataFunTalk
DataFunTalk
Dec 13, 2019 · Databases

Lindorm: High‑Performance Distributed NoSQL Database for Big Data

Lindorm, an Alibaba‑derived distributed NoSQL database built on HBase, delivers multi‑model hybrid storage, five‑fold throughput gains, sub‑millisecond latency, advanced indexing, cloud‑native elasticity, strong/adjustable consistency, and comprehensive security and multi‑tenant features for massive data workloads.

Big DataNoSQLPerformance Optimization
0 likes · 25 min read
Lindorm: High‑Performance Distributed NoSQL Database for Big Data
HomeTech
HomeTech
Dec 12, 2019 · Big Data

Architecture and Design of the Home Data Integration Governance Platform

The article describes the background, architecture, and design principles of a unified big‑data scheduling and data‑exchange platform, detailing its data ingestion “direct‑train”, centralized scheduling engine, and DataX‑based data‑exchange components along with monitoring, alerting, and security features.

Big DataData IntegrationDataX
0 likes · 7 min read
Architecture and Design of the Home Data Integration Governance Platform
Programmer DD
Programmer DD
Dec 11, 2019 · Big Data

Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

This article explores how enterprises can tackle the explosive growth of data by adopting modern big‑data architectures, including storage‑compute separation, data‑driven workflows, risk‑control frameworks, and real‑world Spark optimizations, offering practical guidance for scalable, high‑performance analytics.

Big DataData ArchitectureData-driven
0 likes · 12 min read
Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action
dbaplus Community
dbaplus Community
Dec 10, 2019 · Backend Development

How to Optimize Elasticsearch for Billions of Records: Practical Tuning Guide

An in‑depth guide walks through Elasticsearch’s underlying Lucene architecture, explains shard routing and DocValues, then presents concrete index‑ and search‑performance tweaks—bulk writes, refresh intervals, memory allocation, SSD usage, field mapping, pagination strategies—and shows benchmark results that reduce query latency to seconds for billions of records.

Big DataElasticsearchIndex Optimization
0 likes · 13 min read
How to Optimize Elasticsearch for Billions of Records: Practical Tuning Guide
21CTO
21CTO
Dec 9, 2019 · Big Data

China’s Big Data Crackdown: Legal Risks Every Developer Should Know

The article examines the sweeping regulatory crackdown on China’s big‑data and financial‑risk companies, detailing the dissolution of major crawler firms, new legal restrictions on data collection, and practical guidance on what data‑scraping activities are illegal and how to protect personal information.

Big DataLegal ComplianceWeb Crawling
0 likes · 11 min read
China’s Big Data Crackdown: Legal Risks Every Developer Should Know
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 9, 2019 · Big Data

Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees

This article explains how to develop a real‑time ETL application using Apache Flink that reads events from Kafka, partitions them by event time into HDFS directories, and achieves exactly‑once processing through checkpointing, custom bucket assigners, and proper state backend configuration.

Apache FlinkBig DataExactly-Once
0 likes · 11 min read
Building a Real‑Time ETL Pipeline with Apache Flink: Kafka to HDFS with Exactly‑Once Guarantees
Architecture Digest
Architecture Digest
Dec 8, 2019 · Big Data

Technical Feasibility of a Nationwide WeChat Group with 1.4 Billion Users

The article analyses whether it is technically possible to place all 1.4 billion Chinese users into a single WeChat group, examining population data, message volume, CPU and network requirements, hardware costs, physical space, and human visual limits to assess scalability and practicality.

Big DataNetwork BandwidthServer Architecture
0 likes · 11 min read
Technical Feasibility of a Nationwide WeChat Group with 1.4 Billion Users
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 4, 2019 · Big Data

Comprehensive Flink Interview Guide: Core Concepts, Advanced Topics, and Source‑Code Insights

This article provides an in‑depth Flink interview guide covering the framework’s core concepts, advanced features such as fault‑tolerance, state management, and checkpointing, as well as detailed explanations of its architecture, APIs, partitioning strategies, and source‑code flow, complete with code examples.

Big DataDistributed SystemsFlink
0 likes · 29 min read
Comprehensive Flink Interview Guide: Core Concepts, Advanced Topics, and Source‑Code Insights
Yanxuan Tech Team
Yanxuan Tech Team
Dec 2, 2019 · Big Data

Why Modern Enterprises Need a Data Middle Platform: Lessons from NetEase Yanxuan

Drawing on NetEase Yanxuan’s experience, this article explains what a data middle platform is, why companies are building one for digital transformation and fine‑grained operations, and details its core components—including the data warehouse, data services, and BI platform—illustrated with real‑world diagrams.

BIBig DataData Middle Platform
0 likes · 12 min read
Why Modern Enterprises Need a Data Middle Platform: Lessons from NetEase Yanxuan
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 1, 2019 · Big Data

Understanding Flink LatencyMarker: End-to-End Delay Measurement and Implementation Details

This article explains the background, source‑code analysis, and practical implementation of Flink's LatencyMarker feature for measuring end‑to‑end job latency, including metric exposure, configuration options, and code snippets illustrating how latency markers are emitted and processed within the streaming pipeline.

Big DataEnd-to-End LatencyFlink
0 likes · 6 min read
Understanding Flink LatencyMarker: End-to-End Delay Measurement and Implementation Details
58 Tech
58 Tech
Nov 29, 2019 · Big Data

Application of Big Data and Algorithms in the Real‑Estate Internet

The talk presented at the Shanghai Computer Society Annual Meeting details how big data and algorithms are leveraged in the real‑estate internet sector to enhance user personalization, improve agent matching, and assess video quality, illustrating practical implementations and performance gains across data collection, modeling, and recommendation pipelines.

AIBig DataReal Estate
0 likes · 10 min read
Application of Big Data and Algorithms in the Real‑Estate Internet
Mafengwo Technology
Mafengwo Technology
Nov 28, 2019 · Big Data

Why NiFi Beats Flink: Practical Data Flow for Recommendation Engines

This article explains why the team prefers Apache NiFi over Flink or Storm for data‑flow handling in information‑stream recommendation systems, outlines NiFi’s core components, features, cluster setup, custom processor development, and real‑world use cases such as HDFS, Elasticsearch, and RocketMQ integrations.

Big DataNiFiProcessor Development
0 likes · 17 min read
Why NiFi Beats Flink: Practical Data Flow for Recommendation Engines
58 Tech
58 Tech
Nov 27, 2019 · Information Security

Evolution and Architecture of a Big Data‑Driven Security Portrait System at 58.com

The article details the design, multi‑stage evolution, and operational impact of a big‑data‑based security portrait platform built by 58.com, describing its data pipelines, real‑time risk tagging, strategy scheduling, configuration management, and overall architecture that enable large‑scale threat detection and mitigation.

Big DataRisk managementsecurity
0 likes · 15 min read
Evolution and Architecture of a Big Data‑Driven Security Portrait System at 58.com
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 26, 2019 · Big Data

Understanding Flink SQL Window Functions: Types, Implementation, and Emit Triggers

This article provides a comprehensive overview of Flink SQL window functions, detailing time‑based window types, their underlying implementation in the StreamExecGroupWindowAggregate operator, the processing flow of WindowOperator, timer handling, emit/trigger strategies, and practical code examples for Tumble, Hop, and Session windows.

Big DataEmitFlink
0 likes · 20 min read
Understanding Flink SQL Window Functions: Types, Implementation, and Emit Triggers
Architecture Digest
Architecture Digest
Nov 25, 2019 · Big Data

Introduction to Apache Kafka: Core Concepts, Architecture, and APIs

This article provides a comprehensive overview of Apache Kafka, covering its fundamental capabilities, typical use cases, core components, key APIs, and essential concepts such as topics, partitions, segments, brokers, producers, and consumers, illustrated with diagrams.

APIsBig DataDistributed Systems
0 likes · 8 min read
Introduction to Apache Kafka: Core Concepts, Architecture, and APIs
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 24, 2019 · Big Data

Common Apache Kafka Exceptions and Their Causes

This article lists frequent Apache Kafka exceptions such as UnknownTopicOrPartitionException, LEADER_NOT_AVAILABLE, NotLeaderForPartitionException, TimeoutException, RecordTooLargeException, and others, explaining each error message, typical reasons, and practical troubleshooting steps for producers and consumers.

Big DataConsumerError Handling
0 likes · 5 min read
Common Apache Kafka Exceptions and Their Causes
Architecture Digest
Architecture Digest
Nov 22, 2019 · Big Data

Elasticsearch Optimization Practices for Large‑Scale Data Platforms

This article presents a comprehensive guide to optimizing Elasticsearch for massive data volumes, covering Lucene fundamentals, index and shard design, practical performance‑tuning techniques, and real‑world testing results that enable cross‑month queries and sub‑second response times.

Big DataElasticsearchIndex Optimization
0 likes · 14 min read
Elasticsearch Optimization Practices for Large‑Scale Data Platforms
Meituan Technology Team
Meituan Technology Team
Nov 21, 2019 · Big Data

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

Meituan Homestay created a platform‑wide Jupyter service built on JupyterHub and Kubernetes that integrates Spark, scheduling, documentation and storage, providing seamless, reproducible notebooks with custom extensions, magics and container isolation to unify data analysis, model training and production workflows.

Big DataJupyterKubernetes
0 likes · 19 min read
Designing a Platformized Jupyter Service Integrated with Spark for Meituan
DataFunTalk
DataFunTalk
Nov 21, 2019 · Big Data

Evolution of 58.com Real-Time Computing Platform and the One-Stop Streaming Data Processing System Wstream

The article details the technical evolution of 58.com’s real-time computing platform—from Storm and Spark Streaming to a Flink‑based one‑stop solution called Wstream—covering use cases, architecture, stability measures, migration from Storm, operational diagnostics, and future development plans.

Big DataFlinkReal-time Streaming
0 likes · 11 min read
Evolution of 58.com Real-Time Computing Platform and the One-Stop Streaming Data Processing System Wstream
Xianyu Technology
Xianyu Technology
Nov 21, 2019 · Big Data

Event-Driven Rule Engine for User Growth at Xianyu

To accelerate growth on Xianyu’s 20 million‑DAU platform, the team built an event‑driven rule engine with a SQL‑like DSL that translates user‑behavior streams into real‑time Flink/Blink queries, cutting rule development from four days to half a day and achieving sub‑5‑second processing latency.

Big DataDSLEvent Stream
0 likes · 9 min read
Event-Driven Rule Engine for User Growth at Xianyu
JD Retail Technology
JD Retail Technology
Nov 19, 2019 · Industry Insights

How JD.com Is Building an Open, Integrated Tech Ecosystem Across Retail, Logistics, and Cloud

JD.com's 2019 JDDiscovery conference revealed a comprehensive, cloud‑native technology landscape that spans AI, big data, IoT, and blockchain, detailing how the company has transformed its integrated retail, logistics, and finance systems into modular, open‑service solutions for external partners.

Artificial IntelligenceBig DataCloud Computing
0 likes · 9 min read
How JD.com Is Building an Open, Integrated Tech Ecosystem Across Retail, Logistics, and Cloud
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 18, 2019 · Big Data

Understanding JVM Garbage Collection and Flink Memory Management

This article explains the fundamentals of JVM garbage collection, its generational algorithms and associated performance issues, and then details Apache Flink's memory management architecture, including MemorySegment, off‑heap buffers, serialization mechanisms, and type information for efficient big‑data processing.

Big DataFlinkGarbage Collection
0 likes · 7 min read
Understanding JVM Garbage Collection and Flink Memory Management
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 14, 2019 · Big Data

Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure

This article compares Flink and Spark Structured Streaming, detailing their differences in join capabilities, state management, fault‑tolerance mechanisms, exactly‑once semantics, back‑pressure handling, and table registration, while providing code examples and practical insights for real‑time big‑data processing.

Big DataFlinkJOIN
0 likes · 13 min read
Comparison of Flink and Spark Structured Streaming: Joins, State Management, Fault Tolerance, and Backpressure
Tencent Cloud Developer
Tencent Cloud Developer
Nov 14, 2019 · Big Data

Tencent Announces Open‑Source High‑Performance Graph Computing Framework Plato

Tencent has open‑sourced its high‑performance graph computing framework Plato, which can process billion‑node graphs in minutes on as few as ten servers, outpacing Spark GraphX by up to two orders of magnitude, and supports offline computation, representation learning, and integration with Kubernetes/YARN for social, recommendation, and biomedical applications.

Big DataDistributed SystemsOpen-source
0 likes · 7 min read
Tencent Announces Open‑Source High‑Performance Graph Computing Framework Plato
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 13, 2019 · Databases

ClickHouse Engines: Use Cases, Syntax, and Limitations

This article provides a comprehensive overview of ClickHouse, covering its typical application scenarios, inherent limitations, common SQL syntax, default values, data types, materialized and expression columns, and detailed explanations of its various storage engines such as TinyLog, Log, Memory, Merge, Distributed, Null, Buffer, Set, MergeTree, ReplacingMergeTree, SummingMergeTree, AggregatingMergeTree, and CollapsingMergeTree, accompanied by practical code examples.

Big DataClickHouseDatabase Engines
0 likes · 25 min read
ClickHouse Engines: Use Cases, Syntax, and Limitations
DataFunTalk
DataFunTalk
Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Big DataMaterialized ColumnsShuffle Optimization
0 likes · 20 min read
ByteDance’s Core Optimization Practices on Spark SQL
DevOps
DevOps
Nov 11, 2019 · Operations

Capital One DevOps Transformation: Data‑Driven Innovation, Cloud Migration, and AI‑Enabled Services

This case study details Capital One’s evolution from a regional credit‑card unit to a data‑centric financial giant, highlighting its vision, data‑driven product strategy, big‑data analytics, AI‑powered customer service, cloud migration to AWS, and the DevOpsSec practices that enabled rapid, secure, and scalable innovation across banking, automotive finance, and digital services.

Big DataDevOpsFintech
0 likes · 19 min read
Capital One DevOps Transformation: Data‑Driven Innovation, Cloud Migration, and AI‑Enabled Services
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 9, 2019 · Big Data

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

This article examines Xiaomi's migration from Spark Streaming to Apache Flink, comparing scheduling strategies, mini‑batch versus true streaming, resource utilization, latency, and serialization mechanisms, and concludes with practical insights and custom optimization techniques for large‑scale data processing.

Big DataFlinkMini-Batch
0 likes · 17 min read
Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization
JD Retail Technology
JD Retail Technology
Nov 7, 2019 · Industry Insights

How JD’s Advertising Architecture Scaled for 11.11: Lessons in Cost‑Cutting and Performance

The article details how JD’s advertising division tackled the massive traffic surge of the 11.11 shopping festival by expanding shard capacity, optimizing models and data pipelines, migrating workloads to the cloud, and implementing cost‑saving measures that together ensured stable, high‑performance ad delivery.

AdvertisingBig DataPerformance Optimization
0 likes · 7 min read
How JD’s Advertising Architecture Scaled for 11.11: Lessons in Cost‑Cutting and Performance
DataFunTalk
DataFunTalk
Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkKafka
0 likes · 14 min read
Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans
Xianyu Technology
Xianyu Technology
Nov 7, 2019 · Big Data

Sequence Pattern Mining for User Behavior Analysis in Xianyu

By applying sequence pattern mining and unsupervised clustering to Xianyu’s massive event logs, the study abstracts high‑level user behaviors, discovers frequent subsequences, uncovers unknown fraudulent account patterns, expands known fraud cohorts with 99 % precision, and enables richer analyses such as PCA‑based cross‑group comparisons.

Big Dataclusteringdata mining
0 likes · 8 min read
Sequence Pattern Mining for User Behavior Analysis in Xianyu
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 5, 2019 · Operations

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

Big DataOperationsaiops
0 likes · 15 min read
How 360 Scaled AIOps: From Data to Self‑Healing Operations
Architecture Digest
Architecture Digest
Nov 5, 2019 · Big Data

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

This article examines the big‑data architectures of three leading Chinese internet companies—Taobao, Meituan, and Didi—detailing their data sources, synchronization mechanisms, batch and streaming processing layers, and the common scheduling components that unify their Hadoop‑based ecosystems.

Big DataData ArchitectureDidi
0 likes · 7 min read
Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms
Efficient Ops
Efficient Ops
Nov 3, 2019 · Operations

How Beijing Mobile Achieved Tier‑3 DevOps Maturity: A Deep Dive into Continuous Delivery

This article details Beijing Mobile's successful Tier‑3 DevOps standard assessment, showcasing their micro‑service, container‑based performance management system, the role of standards and tooling in boosting efficiency, and insights from a Q&A with senior engineers on implementation challenges and future DevOps prospects.

AIBig DataContainerization
0 likes · 11 min read
How Beijing Mobile Achieved Tier‑3 DevOps Maturity: A Deep Dive into Continuous Delivery
Efficient Ops
Efficient Ops
Nov 3, 2019 · Operations

How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps

Zhejiang Mobile’s IT department chronicles its journey from a 2015 cloud‑native initiative to a cutting‑edge AIOps transformation, detailing a six‑level NoOps roadmap, digital fault‑governance, middle‑platform consolidation, organizational agility, and measurable operational gains that position it as a telecom industry leader.

Artificial IntelligenceBig DataDigital Transformation
0 likes · 7 min read
How Zhejiang Mobile Is Pioneering AIOps to Reach NoOps
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 3, 2019 · Big Data

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

This article explains the evolution of Spark Shuffle from hash‑based to sort‑based, introduces the Smart Shuffle optimization, details their implementations and configurations, and presents performance comparisons using TPC‑DS benchmarks, highlighting significant speedups and reduced I/O overhead.

Big DataShuffleSmart Shuffle
0 likes · 7 min read
Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 2, 2019 · Big Data

Evolution of Elasticsearch Cluster Architecture for JD Daojia Order Center

This article details how JD Daojia's order center migrated its Elasticsearch cluster through multiple architectural stages—from an initial loosely configured setup to a real‑time dual‑cluster solution—addressing scalability, high availability, data synchronization, and performance optimization for billions of documents and hundreds of millions of daily queries.

Big DataCluster ArchitectureElasticsearch
0 likes · 12 min read
Evolution of Elasticsearch Cluster Architecture for JD Daojia Order Center
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 30, 2019 · Big Data

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

This tutorial explains how to create a highly scalable, fault‑tolerant real‑time data processing platform by configuring a Kafka topic, a Cassandra keyspace, adding Spark and connector dependencies, developing a Java‑based Spark Streaming pipeline, enabling checkpoints, and deploying the application with spark‑submit.

Big DataKafkaReal-Time
0 likes · 8 min read
Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 30, 2019 · Big Data

How Real-Time Big Data Pipelines Detect E‑Commerce Ad Misplacements

This article explains how a large‑scale e‑commerce search advertising system uses real‑time big‑data pipelines, log synchronization, NoSQL storage, and proactive verification to automatically discover and correct ad placement errors across the entire data processing chain, protecting both advertisers and the platform.

Big Dataad verificationdata pipeline
0 likes · 13 min read
How Real-Time Big Data Pipelines Detect E‑Commerce Ad Misplacements
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 28, 2019 · Big Data

Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing

This article outlines the challenges of various big‑data scenarios such as financial risk control, recommendation systems, and social feeds, explains why Spark is chosen over alternatives, describes a one‑stop data platform architecture with Spark‑HBase integration, and shares best‑practice tips and case studies.

Big DataData ArchitectureHBase
0 likes · 7 min read
Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing
DataFunTalk
DataFunTalk
Oct 25, 2019 · Big Data

Migrating Data from HBase to Kafka Using MapReduce

This article explains how to reverse the typical data flow by extracting massive Rowkeys from HBase with MapReduce, storing them on HDFS, and then using batch Get operations to retrieve the full records and write them into Kafka, while handling retries and monitoring progress.

Big DataData MigrationHBase
0 likes · 9 min read
Migrating Data from HBase to Kafka Using MapReduce
dbaplus Community
dbaplus Community
Oct 22, 2019 · Big Data

How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink

This article details how Weibo’s advertising team designed and implemented a real‑time data platform capable of processing over a hundred billion daily logs, covering technology selection, Flink advantages, architecture evolution, data processing pipelines, component libraries, fault‑tolerance strategies, and the construction of a multi‑layer real‑time data warehouse.

Big DataCheckpointData Architecture
0 likes · 25 min read
How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 22, 2019 · Big Data

Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive

This article explains how to design and implement a real‑time data verification framework using Flink to generate wide tables, storing detailed records in Elasticsearch or HDFS with Hive for cross‑checking against offline data, ensuring trustworthy metrics for dashboards and stakeholders.

Big DataData verificationElasticsearch
0 likes · 7 min read
Real-Time Data Verification: Building a Log Comparison Solution with Flink, Elasticsearch, and Hive
58 Tech
58 Tech
Oct 21, 2019 · Big Data

Improving Information Exposure Measurement: Visible Ad Metrics and Data Processing Practices at 58 Platform

To address inaccuracies in traditional information exposure metrics, this article proposes adopting advertising visibility standards—defining visible exposure by pixel and time thresholds, implementing client-side logging, unique TID tracking, and ETL pipelines—to provide more reliable data for product strategy and user behavior analysis.

Big DataData Qualityad visibility
0 likes · 8 min read
Improving Information Exposure Measurement: Visible Ad Metrics and Data Processing Practices at 58 Platform
dbaplus Community
dbaplus Community
Oct 20, 2019 · Big Data

Mastering Kafka: Concepts, Installation, Optimization, and Security

This comprehensive guide covers Kafka's core concepts, design principles, installation steps, configuration tweaks, performance optimizations, permission management, common operational commands, cluster scaling, log retention settings, and monitoring scripts to help you build and maintain a robust Kafka ecosystem.

Big DataInstallationKafka
0 likes · 20 min read
Mastering Kafka: Concepts, Installation, Optimization, and Security
Architects' Tech Alliance
Architects' Tech Alliance
Oct 17, 2019 · Big Data

Understanding Alibaba's Data Middle Platform: Concepts, Architecture, and Differences from Data Warehouses and Data Lakes

The article explains Alibaba's data middle platform—its definition, methodology, organizational structure, key tools, and how it differs from traditional data warehouses and data lakes—while highlighting its role in supporting scalable, business‑centric data services and digital transformation.

AlibabaBig DataData Architecture
0 likes · 16 min read
Understanding Alibaba's Data Middle Platform: Concepts, Architecture, and Differences from Data Warehouses and Data Lakes
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 17, 2019 · Big Data

Delta Lake: Architecture, Features, and Hands‑On Tutorial

This article explains the origins and motivations of Delta Lake, details its ACID transaction support, schema enforcement, metadata handling, versioning, and unified batch‑and‑stream processing, and provides a step‑by‑step Maven and Spark code tutorial for creating, updating, and querying Delta tables.

ACIDApache SparkBig Data
0 likes · 10 min read
Delta Lake: Architecture, Features, and Hands‑On Tutorial
Meituan Technology Team
Meituan Technology Team
Oct 17, 2019 · Big Data

OneData Methodology: Building a Unified Data Warehouse Architecture and Governance Framework

By adapting Alibaba’s OneData methodology, the project establishes a unified data‑warehouse architecture, standards, and governance framework—including consolidated business intake, standardized design layers, naming conventions, and delivery metrics—that resolves data‑quality issues, enhances scalability and reusability, and delivers faster, reliable data support for evolving business needs.

Big DataData ArchitectureData Governance
0 likes · 15 min read
OneData Methodology: Building a Unified Data Warehouse Architecture and Governance Framework
Youku Technology
Youku Technology
Oct 16, 2019 · Artificial Intelligence

Building an Entertainment Content Cognition Brain: AI and Big Data for the Full Content Lifecycle

The talk outlines how Alibaba’s Entertainment Brain leverages AI, big-data analytics, and psychological modeling to map content attributes and user emotions across the entire production-to-distribution lifecycle, enabling data-driven talent selection, script evaluation, real-time feedback, and predictive traffic forecasting for hit-making.

AIBig DataContent Analytics
0 likes · 11 min read
Building an Entertainment Content Cognition Brain: AI and Big Data for the Full Content Lifecycle
Efficient Ops
Efficient Ops
Oct 14, 2019 · Operations

How AIOps Transforms IT Operations: Real-World Architecture and Lessons

This article shares a practical case study of implementing AIOps in an online‑education company, covering the background pain points of massive monitoring data, the designed architecture with real‑time processing and machine‑learning pipelines, and the challenges and opportunities of intelligent operations.

Big DataIT Operationsaiops
0 likes · 14 min read
How AIOps Transforms IT Operations: Real-World Architecture and Lessons
JD Retail Technology
JD Retail Technology
Oct 14, 2019 · Databases

Overview of JDNoSQL Platform and Its Real-Time Advertising Use Cases

The article introduces JDNoSQL, a distributed column‑oriented key‑value store built on HDFS, outlines its core features, describes various business scenarios including real‑time ad computation, details the system architecture with Kafka and Flink, and presents table designs for ad impression and click statistics.

Big DataFlinkKafka
0 likes · 13 min read
Overview of JDNoSQL Platform and Its Real-Time Advertising Use Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 14, 2019 · Big Data

Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization

This article presents a comprehensive analysis of Spark PageRank performance, detailing the algorithm's basics, the original example code, and four key optimizations—caching with checkpointing, memory‑efficient data structures, handling data skew, and maximizing executor and driver resource usage—backed by experimental results and practical recommendations.

Big DataCacheCheckpoint
0 likes · 18 min read
Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization
58 Tech
58 Tech
Oct 10, 2019 · Big Data

Optimizing Real‑Time Feature Extraction at 58.com: Migrating from Spark Streaming to Flink

This article describes how 58.com’s commercial engineering team redesigned its real‑time feature‑mining pipeline—replacing a minute‑level Spark Streaming framework with Flink—to achieve sub‑second latency, higher throughput, stronger fault‑tolerance, and end‑to‑end exactly‑once semantics for user‑profile generation in the second‑hand‑car recommendation scenario.

Big DataExactly-OnceFlink
0 likes · 14 min read
Optimizing Real‑Time Feature Extraction at 58.com: Migrating from Spark Streaming to Flink
Sohu Tech Products
Sohu Tech Products
Oct 9, 2019 · Databases

HBase Table Design Strategies: Data Model, Column Descriptors, RowKey, Region and Performance Optimization

This article explains HBase’s data model and provides comprehensive table‑design strategies—including column‑descriptor options, row‑key best practices, high‑vs‑wide table trade‑offs, region splitting and pre‑splitting techniques—to help achieve optimal performance and scalability in large‑scale NoSQL workloads.

Big DataColumn FamilyHBase
0 likes · 16 min read
HBase Table Design Strategies: Data Model, Column Descriptors, RowKey, Region and Performance Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 9, 2019 · Big Data

Choosing and Using Flink State Backends: MemoryStateBackend, FsStateBackend, and RocksDBStateBackend

This article explains how Flink checkpoints persist state, compares the three built‑in state backends (MemoryStateBackend, FsStateBackend, RocksDBStateBackend), discusses their configurations, advantages, limitations, and provides guidance on selecting the appropriate backend for different big‑data streaming scenarios.

Big DataCheckpointFlink
0 likes · 10 min read
Choosing and Using Flink State Backends: MemoryStateBackend, FsStateBackend, and RocksDBStateBackend
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 9, 2019 · Cloud Computing

The Next Decade of Cloud Networking: Highlights from Alibaba Cloud Network Forum at Yunqi Conference 2019

The 2019 Yunqi Conference Cloud Network Forum gathered over two hundred network enthusiasts to review a decade of Alibaba data‑center networking evolution, explore emerging technologies such as AI, big data, and programmable chips, and outline the next ten years of high‑performance, data‑centric cloud networking.

Big DataHigh‑Performance Networkingnetwork architecture
0 likes · 9 min read
The Next Decade of Cloud Networking: Highlights from Alibaba Cloud Network Forum at Yunqi Conference 2019
dbaplus Community
dbaplus Community
Oct 8, 2019 · Big Data

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

This article shares a senior data‑platform engineer's hands‑on experience managing dozens of thousand‑node clusters, detailing nine common cluster problems and step‑by‑step solutions—including performance tuning, RPC fixes, HDFS cleanup, Hive metadata repair, Spark shuffle optimization, HBase region recovery, and Kafka bottleneck mitigation.

Big DataCluster ManagementHBase
0 likes · 17 min read
How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases
Architects' Tech Alliance
Architects' Tech Alliance
Oct 7, 2019 · Industry Insights

How Google’s Vision Drove the PC Web, Big Data, and Cloud Revolutions

The article traces Google’s decade‑long impact on the evolution of the PC Web era, its pioneering technologies in search, email, infrastructure, big data, cloud computing, and mobile, explaining how its philosophy both propelled and missed commercial opportunities across each wave of internet innovation.

Big DataCloud ComputingGoogle
0 likes · 11 min read
How Google’s Vision Drove the PC Web, Big Data, and Cloud Revolutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 3, 2019 · Big Data

Data Development Interview Tips and Career Guidance

This article offers practical advice for data development job interviews, explaining why Java is essential, comparing Java and Python, outlining required backend framework knowledge, discussing the role of SQL and data warehousing, and addressing work‑life concerns such as overtime and company size choices.

Big DataPythoncareer advice
0 likes · 4 min read
Data Development Interview Tips and Career Guidance
Programmer DD
Programmer DD
Sep 29, 2019 · Big Data

Can 1.4 Billion People Share a Single WeChat Group? A Technical Deep‑Dive

This article explores whether it is technically feasible to place all 1.4 billion Chinese users into one WeChat group, analyzing population statistics, message volume, CPU processing limits, network bandwidth, storage requirements, and cost implications with supporting calculations and references.

Big DataDistributed SystemsNetwork Bandwidth
0 likes · 12 min read
Can 1.4 Billion People Share a Single WeChat Group? A Technical Deep‑Dive
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Sep 27, 2019 · Big Data

Design Principles and Architecture of Apache Kylin for Sub‑Second OLAP Queries

This article explains how Apache Kylin, an open‑source distributed analytics engine built on Hadoop/Spark, achieves sub‑second OLAP query performance through pre‑computed cubes, a layered cuboid generation algorithm, bitmap‑based distinct counting, dimension optimization techniques, and tight integration with HBase for storage and fast SQL querying.

Apache KylinBig DataCube
0 likes · 15 min read
Design Principles and Architecture of Apache Kylin for Sub‑Second OLAP Queries
Meituan Technology Team
Meituan Technology Team
Sep 26, 2019 · Big Data

Big Data Technology: Commercial Applications and Practice – A Collaborative Course between Meituan and Tsinghua University

Meituan’s big‑data team and Tsinghua’s Electronic Engineering Department have launched a master‑level, credit‑bearing course that blends theory with 24 hours of hands‑on training, showcases Meituan’s real‑world data infrastructure and applications, and aims to create a recurring bridge between academia and industry while recruiting top talent.

Big DataCommercial ApplicationData Analytics
0 likes · 6 min read
Big Data Technology: Commercial Applications and Practice – A Collaborative Course between Meituan and Tsinghua University
dbaplus Community
dbaplus Community
Sep 24, 2019 · Big Data

How Weibo Turns Big Data into Revenue: Insights from a 2019 DAMS Talk

The presentation explains how Weibo leverages big‑data technologies, user profiling, and social‑first advertising models to drive commercial growth, detailing data‑driven product development, real‑time and offline data warehouses, scientific experiments, and case studies that illustrate the impact on revenue and user engagement.

AdvertisingBig DataGrowth Hacking
0 likes · 24 min read
How Weibo Turns Big Data into Revenue: Insights from a 2019 DAMS Talk
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 24, 2019 · Artificial Intelligence

How Semi‑Supervised Deep Learning Detects Road Closures in Real‑Time

Gaode’s engineering team presents a semi‑supervised deep‑learning framework that models road networks, extracts traffic, routing, deviation and heatmap features, and combines LSTM with ResNet to accurately identify dynamic road‑closure events, enabling both offline and real‑time detection with high confidence and business‑aligned validation.

Big DataLSTMResNetSemi-supervised Learning
0 likes · 12 min read
How Semi‑Supervised Deep Learning Detects Road Closures in Real‑Time