Tagged articles
3675 articles
Page 22 of 37
Top Architect
Top Architect
May 4, 2021 · Big Data

Overview of CDC Tools: Canal, Maxwell, Databus, and Alibaba DTS

This article introduces four change‑data‑capture solutions—Canal, Maxwell, Databus, and Alibaba Data Transmission Service (DTS)—explaining their principles, processing steps, features, and practical advantages for real‑time data synchronization and migration in big‑data environments.

Alibaba DTSBig DataCDC
0 likes · 6 min read
Overview of CDC Tools: Canal, Maxwell, Databus, and Alibaba DTS
Python Crawling & Data Mining
Python Crawling & Data Mining
May 4, 2021 · Big Data

Unlock 100+ Free Data APIs with Just 3 Lines of Python

This article introduces the GoPUP library, which provides over a hundred free data interfaces—including social media indexes, macro‑economic figures, company information, and epidemic statistics—accessible with simple Python code, making data analysis faster and easier.

APIBig DataPython
0 likes · 7 min read
Unlock 100+ Free Data APIs with Just 3 Lines of Python
DataFunTalk
DataFunTalk
May 2, 2021 · Big Data

Continuous Optimization and Practice of Flink at Kuaishou

This article presents Kuaishou's comprehensive engineering practices for improving Flink's stability, task startup latency, and SQL performance, including high‑availability Kafka connectors, fault‑recovery mechanisms, I/O reductions, asynchronous job upgrades, aggregation optimizations, and future resource‑utilization plans.

Big DataFlinkKafka
0 likes · 10 min read
Continuous Optimization and Practice of Flink at Kuaishou
Architects' Tech Alliance
Architects' Tech Alliance
May 2, 2021 · Big Data

Understanding Data Middle Platform: Concepts, Drivers, Architecture, and Industry Trends

The article explains the concept of a data middle platform, its role in integrating and centralizing enterprise data, the drivers behind its adoption, architectural layers, implementation challenges, market landscape, and real‑world case studies, highlighting how big‑data, cloud and AI technologies enable digital transformation.

AIBig DataDigital Transformation
0 likes · 15 min read
Understanding Data Middle Platform: Concepts, Drivers, Architecture, and Industry Trends
IT Architects Alliance
IT Architects Alliance
May 1, 2021 · Big Data

Comprehensive Guide to ELK Stack (Elasticsearch, Logstash, Kibana) Installation, Configuration, and Architecture

This article provides a detailed overview of the ELK stack—including Elasticsearch, Logstash, Kibana, and Beats—explaining its components, why to use it for centralized log management, various deployment architectures, system tuning, security setup, and step‑by‑step installation and configuration commands for a production‑grade environment.

Big DataELKElasticsearch
0 likes · 22 min read
Comprehensive Guide to ELK Stack (Elasticsearch, Logstash, Kibana) Installation, Configuration, and Architecture
Programmer DD
Programmer DD
Apr 30, 2021 · Big Data

Kafka 2.8.0 Release: Say Goodbye to ZooKeeper with Raft Metadata Mode

Kafka 2.8.0, released on April 19, 2021, introduces the groundbreaking Raft Metadata mode that eliminates the need for ZooKeeper, alongside numerous new features, bug fixes, and enhancements such as API controls for stream threads, SASL_SSL mutual TLS, and IP rate limiting.

Big DataKafkaRaft
0 likes · 5 min read
Kafka 2.8.0 Release: Say Goodbye to ZooKeeper with Raft Metadata Mode
Architect
Architect
Apr 29, 2021 · Big Data

ELK Stack (Elasticsearch, Logstash, Kibana) Overview, Architecture, Installation, and Configuration Guide (Version 7.7.0)

This article provides a comprehensive introduction to the ELK stack—including component descriptions, architectural diagrams, reasons for adoption, and step‑by‑step installation and configuration of Filebeat, Logstash, Elasticsearch, and Kibana on Linux, with optional Kafka integration for advanced pipelines.

Big DataELKElasticsearch
0 likes · 22 min read
ELK Stack (Elasticsearch, Logstash, Kibana) Overview, Architecture, Installation, and Configuration Guide (Version 7.7.0)
DataFunTalk
DataFunTalk
Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkBig DataGPU Acceleration
0 likes · 19 min read
Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features
DataFunTalk
DataFunTalk
Apr 27, 2021 · Big Data

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

This article describes how Linkflow migrated mutable customer data from MySQL to an Apache Hudi data lake using Debezium‑in‑Flink CDC, addressing challenges such as snapshot resumability, partial updates, row‑key merging, schema evolution, indexing, and concurrent writes to achieve minute‑level data freshness and improved offline processing performance.

Apache HudiBig DataCDC
0 likes · 21 min read
Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System
DataFunTalk
DataFunTalk
Apr 23, 2021 · Big Data

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

This article details Zhihu’s transition from a Sqoop‑driven data integration system to a Flink‑centric platform, covering business scenarios, historical architecture, design goals, technology choices, performance optimizations, and future plans for unified streaming‑batch processing across diverse storage systems.

Batch ProcessingBig DataData Integration
0 likes · 14 min read
Building and Evolving Zhihu’s Flink‑Based Data Integration Platform
IT Architects Alliance
IT Architects Alliance
Apr 23, 2021 · Industry Insights

Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests

This article provides an in‑depth technical overview of Toutiao’s rapid growth, data collection pipelines, user modeling, cold‑start strategies, recommendation engine architecture, storage solutions, push notification system, microservice design, and its three‑layer PaaS platform, illustrating how the news app serves hundreds of millions of users daily.

Big DataIndustry InsightSystem Architecture
0 likes · 8 min read
Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests
Laravel Tech Community
Laravel Tech Community
Apr 22, 2021 · Big Data

Apache Kafka 2.8.0 Release Highlights and New Features

Apache Kafka 2.8.0 introduces several significant enhancements, including a new group API, mutual TLS authentication for SASL_SSL listeners, JSON request/response logging, broker connection rate limiting, topic identifiers, self‑managed quorum replacing ZooKeeper, and numerous improvements to Streams and Connect APIs for more reliable real‑time data pipelines.

Apache KafkaBig DataDistributed Systems
0 likes · 2 min read
Apache Kafka 2.8.0 Release Highlights and New Features
Xianyu Technology
Xianyu Technology
Apr 22, 2021 · Big Data

Real-time Performance Optimization of the Mahé Selection and Delivery System

By classifying data streams, aggregating large‑scale T+1 records in six‑hour windows, encoding attributes with multi‑value mappings, storing compressed rule‑hit backups, and synchronizing recall tables in real time, Mahé’s selection‑and‑delivery pipeline cut end‑to‑end latency from minutes to seconds, achieving robust second‑level responsiveness.

Big DataPerformance OptimizationReal-Time
0 likes · 12 min read
Real-time Performance Optimization of the Mahé Selection and Delivery System
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 22, 2021 · Big Data

Debunking Common Misconceptions About Data Lakes

This article debunks eight common misconceptions about data lakes, explains why they are not mutually exclusive with data warehouses, clarifies that they are not limited to Hadoop or raw data only, and provides practical tips for building flexible, secure, and business‑driven data lake solutions.

AnalyticsBig DataCloud Services
0 likes · 21 min read
Debunking Common Misconceptions About Data Lakes
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Apr 21, 2021 · Big Data

Designing an Industrial Internet Big Data Platform: Key Strategies

This article presents a comprehensive construction plan for an Industrial Internet big data platform, detailing its overall architecture, data acquisition, edge processing, cloud storage, analytics, security measures, and deployment best practices to enable scalable and reliable industrial IoT solutions.

Big DataData AnalyticsIndustrial Internet
0 likes · 1 min read
Designing an Industrial Internet Big Data Platform: Key Strategies
JD Tech
JD Tech
Apr 20, 2021 · Databases

Space-Filling Curves for Efficient Multidimensional Data Storage and Querying

This article introduces space-filling curves such as Z‑ordering, Hilbert, and XZ‑Ordering, explaining their mapping algorithms and how they transform multidimensional spatial data into one‑dimensional indices for efficient storage and querying in key‑value databases, while discussing challenges and practical examples.

Big DataSpace-filling CurvesSpatial Indexing
0 likes · 12 min read
Space-Filling Curves for Efficient Multidimensional Data Storage and Querying
Meituan Technology Team
Meituan Technology Team
Apr 15, 2021 · Big Data

Data Governance Practices at Meituan Hotel & Travel Platform

Meituan’s hotel‑travel platform tackled exploding data‑quality, cost, efficiency, and security issues by establishing a full‑link governance framework—standardized processes, a Data Management Committee, and unified “One Model, One Logic, One Service, One Portal” systems—that cut per‑unit costs by ~40%, boosted engineer productivity over 60%, eliminated major security incidents, and set the stage for autonomous, AI‑driven data governance.

Big DataData GovernanceData Quality
0 likes · 32 min read
Data Governance Practices at Meituan Hotel & Travel Platform
TAL Education Technology
TAL Education Technology
Apr 15, 2021 · Artificial Intelligence

Tsinghua University and TAL Launch Phase II Collaboration on Intelligent Education Research

On April 15, Tsinghua University's Computer Science Department and TAL Education's Joint Research Center inaugurated Phase II of their partnership to advance intelligent education through AI-driven teaching environments, interactive mechanisms, knowledge‑graph construction, and personalized assessment technologies.

Artificial IntelligenceBig DataCollaboration
0 likes · 7 min read
Tsinghua University and TAL Launch Phase II Collaboration on Intelligent Education Research
dbaplus Community
dbaplus Community
Apr 14, 2021 · Big Data

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

This guide compiles practical Spark tuning techniques, covering essential configuration parameters, programming best‑practices, detailed shuffle mechanics, and join optimization strategies, while also addressing common errors and mitigation steps, enabling developers to improve performance and resource utilization in large‑scale data processing jobs.

Big DataError HandlingJOIN optimization
0 likes · 25 min read
Master Spark Performance: Key Tuning, Shuffle & Join Optimization
Programmer DD
Programmer DD
Apr 13, 2021 · Big Data

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.

Big DataFederationHDFS
0 likes · 17 min read
What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features
DevOps
DevOps
Apr 12, 2021 · Fundamentals

Understanding the Digital Economy: Definition, Evolution, and Why It Matters Now

The article explains what the digital economy is, its relationship with digital transformation, the strategic importance placed on it by China's 14th Five‑Year Plan, and offers guidance for IT professionals on how to respond to this emerging national priority.

Artificial IntelligenceBig DataDigital Economy
0 likes · 14 min read
Understanding the Digital Economy: Definition, Evolution, and Why It Matters Now
DataFunTalk
DataFunTalk
Apr 9, 2021 · Big Data

iQIYI Data Middle Platform: Architecture, Capabilities, and Future Outlook

This article explains how iQIYI’s data middle platform addresses the rapid growth and challenges of big data by providing a unified, standardized, and service‑oriented architecture that includes data production, processing, governance, metadata, AI‑enhanced services, and a roadmap for future enhancements.

AIBig Dataarchitecture
0 likes · 23 min read
iQIYI Data Middle Platform: Architecture, Capabilities, and Future Outlook
Top Architect
Top Architect
Apr 9, 2021 · Big Data

Technical Architecture and Data Processing of Toutiao News Feed System

This article provides a comprehensive overview of Toutiao's rapid growth, massive user base, data collection pipelines, user modeling, recommendation engine, storage solutions, message push strategies, micro‑service architecture, and virtualization PaaS platform, illustrating how big‑data technologies enable personalized news delivery at scale.

Big DataToutiaodata pipeline
0 likes · 8 min read
Technical Architecture and Data Processing of Toutiao News Feed System
Big Data Technology Architecture
Big Data Technology Architecture
Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopPerformance
0 likes · 12 min read
Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions
Sohu Tech Products
Sohu Tech Products
Apr 7, 2021 · Big Data

Data Warehouse Architecture and Modeling with Alibaba MaxCompute and DataWorks

This tutorial explains how to select a technical architecture, design a three‑layer data warehouse (ODS, CDM, ADS), model tables and dimensions, choose storage strategies, handle slowly changing dimensions, synchronize data with DataWorks, and implement dimensional modeling and fact tables using Alibaba MaxCompute for big‑data analytics.

Big DataDataWorksMaxCompute
0 likes · 32 min read
Data Warehouse Architecture and Modeling with Alibaba MaxCompute and DataWorks
Big Data Technology Architecture
Big Data Technology Architecture
Apr 5, 2021 · Big Data

Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture

The article reviews the current state of offline Hive‑based data warehouses, explains the emergence of real‑time data warehouses (1.0) built on Kafka and Flink, discusses their limitations, and outlines the progression toward batch‑stream unified architectures (2.0 and 3.0) leveraging data‑lake technologies such as Iceberg.

Batch-Stream IntegrationBig DataFlink
0 likes · 13 min read
Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture
Python Crawling & Data Mining
Python Crawling & Data Mining
Apr 4, 2021 · Big Data

Mastering User Behavior Analysis: 6 Essential Techniques for Data‑Driven Growth

This article explains six key user‑behavior analysis methods—event analysis, retention analysis, distribution analysis, conversion‑funnel analysis, path analysis, and session analysis—showing how they help businesses understand user actions, optimize product design, improve conversion rates, and boost revenue through data‑driven insights.

Big DataRetention Analysisconversion funnel
0 likes · 11 min read
Mastering User Behavior Analysis: 6 Essential Techniques for Data‑Driven Growth
Architect
Architect
Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewPerformance Tuning
0 likes · 47 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
Architect
Architect
Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataPerformanceRDD
0 likes · 33 min read
Spark Performance Optimization Guide: Development and Resource Tuning
Alibaba Cloud Native
Alibaba Cloud Native
Apr 2, 2021 · Cloud Native

How Fluid Turns Kubernetes into a High‑Performance Data Logistics System

This article explains how the open‑source Fluid project addresses the inefficiencies of data‑intensive AI and big‑data workloads in cloud‑native Kubernetes environments by introducing a data‑centric abstraction, dual orchestration mechanisms, and seamless integration with Alluxio to achieve faster, secure, and scalable data access.

AlluxioBig DataCloud Native
0 likes · 19 min read
How Fluid Turns Kubernetes into a High‑Performance Data Logistics System
DataFunTalk
DataFunTalk
Mar 29, 2021 · Big Data

Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations

This article details Beike's large‑scale OLAP platform, explaining why Druid was chosen over Kylin, describing the platform's four‑layer architecture, presenting performance and storage benchmarks, and outlining practical improvements to data ingestion, real‑time distinct counting, and cluster stability for high‑concurrency business scenarios.

Big DataDruidOLAP
0 likes · 19 min read
Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations
Programmer DD
Programmer DD
Mar 29, 2021 · Big Data

Mastering Kafka: High‑Throughput Distributed Messaging Explained

This comprehensive guide introduces Kafka as a high‑throughput, distributed, publish‑subscribe messaging system, detailing its core concepts, architecture, features, replication, log management, reliability guarantees, and typical use cases such as log collection, real‑time analytics, and cross‑cluster mirroring.

Big DataDistributed MessagingKafka
0 likes · 15 min read
Mastering Kafka: High‑Throughput Distributed Messaging Explained
DataFunTalk
DataFunTalk
Mar 27, 2021 · Big Data

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

This article presents an in‑depth technical overview of Kuaishou's massive HDFS deployment, detailing its architecture, petabyte‑scale data and thousands‑of‑node clusters, the key scalability challenges faced, and the custom solutions—including FixedOrder, RBF balancer, observer read, slow‑node mitigation, and tiered protection—implemented to keep the system performant and reliable.

Big DataHDFSKuaishou
0 likes · 12 min read
Kuaishou's HDFS Architecture, Scale, Challenges, and Practices
HelloTech
HelloTech
Mar 26, 2021 · Big Data

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

The article describes how algorithm testing teams tackled data‑quality and interface‑semantic monitoring problems by building a unified business monitoring platform that checks table, storage and service consistency, validates response semantics, and, through dashboards, alerts and correction tools, quickly identified dozens of offline and online issues, guiding future reliability enhancements.

AIBig DataData Quality
0 likes · 26 min read
Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 26, 2021 · Big Data

Evolution of iQIYI's Real-Time Big Data Ecosystem

iQIYI transformed its data infrastructure from a traditional offline T+1 model to a comprehensive real‑time ecosystem—leveraging Kafka, Flink, a three‑layer Stream Data Service Platform, the Talos drag‑and‑drop pipeline, and a Druid‑based analytics platform—to enable low‑latency monitoring, personalized recommendations, ad targeting, and continuous machine‑learning workflows while planning future stream‑batch integration and lake‑warehouse convergence.

AnalyticsBig DataFlink
0 likes · 13 min read
Evolution of iQIYI's Real-Time Big Data Ecosystem
Ctrip Technology
Ctrip Technology
Mar 25, 2021 · Big Data

Challenges and Approaches for Real‑Time Data Aggregation Analysis

The article examines the key challenges of real‑time data aggregation—data freshness, timely processing, and result visibility—and surveys common solutions such as timestamp‑based sync, CDC, full and incremental computation, storage formats, and trigger mechanisms.

Big DataCDCIncremental Computation
0 likes · 11 min read
Challenges and Approaches for Real‑Time Data Aggregation Analysis
Suning Technology
Suning Technology
Mar 24, 2021 · Big Data

How C2M Is Powering the Industrial Internet Boom in 2021

The article examines how policy‑driven industrial internet initiatives, combined with data‑rich C2M models and AIoT integration, are reshaping manufacturing in China, highlighting Suning's smart‑fridge case, strategic partnerships, and the broader push toward a digital‑first industrial era.

AIoTBig DataC2M
0 likes · 8 min read
How C2M Is Powering the Industrial Internet Boom in 2021
DataFunTalk
DataFunTalk
Mar 24, 2021 · Big Data

Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform

This article details how KuJiaLe's big data team replaced their legacy ADB and Presto clusters with a DorisDB MPP database, achieving sub‑second query latency, unified real‑time and offline analytics, simplified ETL pipelines, and significant cost savings while supporting billion‑row tables and high‑QPS workloads.

Big DataDorisDBETL
0 likes · 9 min read
Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform
DataFunTalk
DataFunTalk
Mar 21, 2021 · Big Data

Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations

This article presents ByteDance's recent Flink enhancements, detailing a single‑point recovery mechanism for the network layer and a regional checkpoint strategy that together improve failover latency, reduce output loss, and enable scalable, high‑throughput stream processing for large‑scale real‑time recommendation workloads.

Big DataCheckpointFlink
0 likes · 12 min read
Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations
dbaplus Community
dbaplus Community
Mar 20, 2021 · Big Data

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

Big DataDirect ModeHadoop
0 likes · 11 min read
How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop
Xianyu Technology
Xianyu Technology
Mar 18, 2021 · Backend Development

Multi-Engine Concurrent Search Architecture for Idlefish

Idlefish’s new multi‑engine concurrent search architecture replaces the tightly‑coupled single‑engine pipeline with deep engine isolation, asynchronous multi‑engine recall, and unified result merging, cutting dump build time from 14 h to 5 h, shrinking memory use dramatically, improving latency by only ~15 ms, and boosting exposure by 50 % and orders by 33 %.

Big DataLuaQuery Planning
0 likes · 10 min read
Multi-Engine Concurrent Search Architecture for Idlefish
Sohu Tech Products
Sohu Tech Products
Mar 17, 2021 · Big Data

Understanding Simhash: From Traditional Hash to Random Projection LSH

This article explains the principles and implementation of Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, and practical optimizations for large‑scale duplicate detection.

Big DataCosine SimilarityLocality Sensitive Hashing
0 likes · 24 min read
Understanding Simhash: From Traditional Hash to Random Projection LSH
dbaplus Community
dbaplus Community
Mar 16, 2021 · Big Data

How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler

This article explains how Kuaishou’s massive offline compute clusters—tens of thousands of machines processing hundreds of petabytes daily—are managed by a heavily customized YARN stack and the home‑grown Kwai Scheduler, detailing architecture, scheduler evolution, multi‑scenario optimizations, and future scaling plans.

Big DataCluster OptimizationKwai Scheduler
0 likes · 14 min read
How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler
JD Cloud Developers
JD Cloud Developers
Mar 15, 2021 · Artificial Intelligence

Top Tech Weekly: AI Earthquake Monitor, PyTorch 1.8, Language Rankings & More

This developer community weekly roundup highlights CCTV's new big‑data governance platform, RedMonk's programming language rankings, Chromium‑based browsers adopting a four‑week release cycle, PyTorch 1.8 with AMD support, the world’s first AI‑driven earthquake monitoring system, Red Hat OpenShift 4.7, a deep meta‑learning model for city sales prediction, and a CVPR breakthrough in controllable human image generation.

Artificial IntelligenceBig DataCloud Native
0 likes · 9 min read
Top Tech Weekly: AI Earthquake Monitor, PyTorch 1.8, Language Rankings & More
DataFunTalk
DataFunTalk
Mar 15, 2021 · Big Data

Ten Gotchas When Migrating Spark Jobs to Flink

This article shares ten practical pitfalls encountered while moving hour‑level Spark session processing jobs to Apache Flink, covering parallelism skew, state TTL, checkpoint handling, logging, debugging, state migration, Reduce vs Process, input validation, event‑time handling, and the trade‑offs of storing data inside Flink.

Big DataFlinkPerformance
0 likes · 19 min read
Ten Gotchas When Migrating Spark Jobs to Flink
Suning Technology
Suning Technology
Mar 13, 2021 · Artificial Intelligence

How Suning’s AI‑Driven Digital Transformation Is Redefining Retail

At the 2021 National Retail CIO Conference in Shanghai, Suning’s Director Wang Junjie detailed the company’s AI, big‑data and cloud‑based three‑step digital transformation strategy, its suite of five mature digital products, and its call for partners to extend these solutions across industries.

Big DataCloud ComputingDigital Transformation
0 likes · 4 min read
How Suning’s AI‑Driven Digital Transformation Is Redefining Retail
vivo Internet Technology
vivo Internet Technology
Mar 10, 2021 · Big Data

Path Analysis Model Design and Engineering Implementation for Internet Data Operations

The article details the design and engineering of a high‑performance path analysis model for internet data operations, explaining session handling, Sankey visualizations, adjacency‑table storage, multi‑granular session partitioning, Spark‑to‑ClickHouse pipelines, and optimizations that enable billion‑scale user‑path queries in about one second.

Big DataClickHouseOLAP
0 likes · 21 min read
Path Analysis Model Design and Engineering Implementation for Internet Data Operations
DataFunTalk
DataFunTalk
Mar 10, 2021 · Big Data

Hive MetaStore Challenges and Optimizations at Kuaishou

At Kuaishou, the Hive MetaStore service, which stores metadata for Hive, faced scalability and performance challenges due to massive dynamic partitions and high query volume, leading to a series of architectural optimizations—including read‑write separation, API enhancements, traffic control, and federation—to improve stability and efficiency.

Big DataKuaishouMetaStore
0 likes · 15 min read
Hive MetaStore Challenges and Optimizations at Kuaishou
JD Cloud Developers
JD Cloud Developers
Mar 8, 2021 · Artificial Intelligence

Weekly Developer Highlights: Flutter 2, JD Cloud, Flink 1.12.2, AI Breakthroughs

This week’s developer roundup covers Google’s Flutter 2 launch, JD Cloud’s next‑gen server, Apache Flink 1.12.2 bug‑fix release, sidewalk robots classified as pedestrians, Microsoft Mesh mixed‑reality platform, Facebook’s self‑supervised SEER model, plus recent AI research from EMNLP and COLING conferences.

Artificial IntelligenceBig DataFlutter
0 likes · 8 min read
Weekly Developer Highlights: Flutter 2, JD Cloud, Flink 1.12.2, AI Breakthroughs
Top Architect
Top Architect
Mar 5, 2021 · Big Data

Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning

This article explains the architecture and core concepts of Elasticsearch and Lucene, outlines the requirements for cross‑month and high‑speed queries on massive datasets, and provides detailed index and search performance tuning techniques—including bulk writes, shard routing, doc‑values management, and pagination strategies—to achieve sub‑second response times on billions of records.

Big DataElasticsearchIndex Optimization
0 likes · 13 min read
Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Mar 4, 2021 · Big Data

Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

This article explores how data clustering techniques such as linear order, Z‑order, and Hilbert‑curve ordering can be applied in Apache Spark and Apache Iceberg to achieve efficient data skipping on terabyte‑scale tables, dramatically reducing file scans and enabling sub‑second interactive analytics for multi‑dimensional queries.

Big DataData ClusteringData Skipping
0 likes · 20 min read
Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg
Suning Technology
Suning Technology
Mar 3, 2021 · Big Data

How Can China Build a Secure, Free Data Sharing Ecosystem?

The article examines China's push for free public data sharing, highlighting policy directives, the need for top‑level design, security standards, and education to create a unified, safe data‑governance framework that fuels the digital economy.

Big DataData GovernanceDigital Economy
0 likes · 6 min read
How Can China Build a Secure, Free Data Sharing Ecosystem?
21CTO
21CTO
Mar 2, 2021 · Big Data

How Suning’s Data Platform Unifies OLAP, Metrics, Visualization & Reporting

Suning’s Data Middle Platform integrates an accelerated OLAP engine, a star‑schema metric system, a visualization tool built on standardized dimensions, and a unified report portal to solve data silos, improve security, and enable enterprises to evolve into technology‑driven organizations.

AnalyticsBig DataData Platform
0 likes · 3 min read
How Suning’s Data Platform Unifies OLAP, Metrics, Visualization & Reporting
Laravel Tech Community
Laravel Tech Community
Feb 28, 2021 · Big Data

Apache Beam 2.28.0 Release Highlights and New Features

Apache Beam 2.28.0 introduces extensive Parquet support, new hash functions in BeamSQL and ZetaSQL, ApproximateDistinct via HLL, enhanced I/O connectors including SpannerIO for Numeric fields, ParquetIO schema support, KafkaTableProvider thrift, HadoopFormatIO key/value cloning skip, and various other improvements.

Apache BeamBatchBig Data
0 likes · 3 min read
Apache Beam 2.28.0 Release Highlights and New Features
DataFunTalk
DataFunTalk
Feb 28, 2021 · Big Data

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details how Youzan's offline Spark computing platform was transformed for the cloud‑native era by migrating from YARN to Kubernetes, introducing containerization, storage‑compute separation, dynamic allocation, deployment optimizations, and a collection of practical lessons to reduce cost and improve resource utilization.

Big DataPerformance OptimizationResource Management
0 likes · 27 min read
Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned
TAL Education Technology
TAL Education Technology
Feb 25, 2021 · Databases

ClickHouse Overview: Architecture, Features, Performance, and Practical Use Cases at TAL Education

This article provides a comprehensive overview of ClickHouse, covering its background, core features, columnar storage, vectorized execution engine, table engines, distributed architecture, performance benchmarks, real‑world deployment at TAL Education, monitoring practices, encountered challenges, and future planning.

Big DataClickHouseColumnar Database
0 likes · 18 min read
ClickHouse Overview: Architecture, Features, Performance, and Practical Use Cases at TAL Education
DataFunTalk
DataFunTalk
Feb 23, 2021 · Big Data

Meituan Hotel & Travel Data Governance: Journey, Practices, and Future Directions

This article outlines Meituan's hotel‑travel data governance evolution, describing the key quality, cost, security, standardization and efficiency challenges faced as the business scaled, and detailing the organizational, technical, metric, service and product‑entry solutions implemented to achieve systematic, measurable, and automated data governance.

Big DataData Governancedata security
0 likes · 19 min read
Meituan Hotel & Travel Data Governance: Journey, Practices, and Future Directions
DataFunTalk
DataFunTalk
Feb 22, 2021 · Big Data

Optimizing Flink Real-Time Task Resources: Memory and Message Processing Perspectives

This article explores practical methods for optimizing Flink real‑time task resources on Kubernetes, focusing on memory usage analysis via GC logs and message‑processing capacity assessment, proposing automated detection of over‑provisioned memory and CPU, and outlining a workflow for resource adjustment to reduce costs.

Big DataFlinkGC Analysis
0 likes · 18 min read
Optimizing Flink Real-Time Task Resources: Memory and Message Processing Perspectives
dbaplus Community
dbaplus Community
Feb 18, 2021 · Big Data

How JD Search Scaled Real‑Time Analytics with Flink and Doris

This article details JD Search's journey from a Storm‑based pipeline to a Flink‑driven architecture backed by Apache Doris, covering business requirements, technical challenges, design trade‑offs, performance optimizations for massive traffic spikes, and future plans for their real‑time OLAP data warehouse.

Big DataFlinkOLAP
0 likes · 12 min read
How JD Search Scaled Real‑Time Analytics with Flink and Doris
DataFunTalk
DataFunTalk
Feb 17, 2021 · Big Data

Apache Iceberg 0.11.0: New Partition Support, SortOrder, Flink Streaming Reader, and Ecosystem Integrations

The article details Apache Iceberg 0.11.0's core enhancements—including partition changes, SortOrder, extensive Flink and Spark integrations, CDC/Upsert support, hash‑based write distribution to reduce small files, and upcoming 0.12.0 roadmap—while providing practical SQL and API examples for data‑lake practitioners.

Apache IcebergBig DataCDC
0 likes · 13 min read
Apache Iceberg 0.11.0: New Partition Support, SortOrder, Flink Streaming Reader, and Ecosystem Integrations
DataFunTalk
DataFunTalk
Feb 16, 2021 · Big Data

Understanding Presto: Architecture, Query Execution, and Youzan’s Practical Experience

This article explains Presto’s core architecture and low‑latency query execution process, describes how Youzan adopts Presto for various data‑platform scenarios, discusses the evolution of its deployment, and outlines the performance challenges and future enhancements such as Alluxio integration and session property management.

Big DataPerformance OptimizationPresto
0 likes · 13 min read
Understanding Presto: Architecture, Query Execution, and Youzan’s Practical Experience
Architect
Architect
Feb 15, 2021 · Big Data

Elasticsearch Optimization Practices for Large-Scale Data Queries

This article explains how to optimize Elasticsearch for cross‑month and multi‑year queries on billions of records, covering Lucene fundamentals, index and search performance tweaks, configuration settings, and practical testing results to achieve sub‑second response times.

Big DataElasticsearchPerformance
0 likes · 14 min read
Elasticsearch Optimization Practices for Large-Scale Data Queries
Architecture Digest
Architecture Digest
Feb 15, 2021 · Operations

ELK Stack Overview, Architecture, Installation and Configuration Guide (Version 7.7.0)

This article provides a comprehensive introduction to the ELK stack—Elasticsearch, Logstash, Kibana, and Filebeat—including its components, why it’s used for centralized log management, detailed architecture diagrams, step‑by‑step installation commands, configuration examples, and a practical Kafka‑based data pipeline demonstration.

Big DataELKElasticsearch
0 likes · 22 min read
ELK Stack Overview, Architecture, Installation and Configuration Guide (Version 7.7.0)
DataFunTalk
DataFunTalk
Feb 14, 2021 · Big Data

Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap

This talk presents NetEase's practical experience with Impala, covering its core architecture, new features in version 3.x, integration with Apache Iceberg, a custom management platform, profiling and statistics enhancements, as well as future plans involving Kubernetes, Alluxio caching and pre‑computation strategies.

Apache IcebergBig DataCluster Management
0 likes · 13 min read
Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap
DataFunTalk
DataFunTalk
Feb 13, 2021 · Databases

Improving HBase Availability and Reducing Latency Spikes with Replication‑Based Multi‑Path Reads and ZGC

This article describes how the Didi HBase team tackled HBase’s weak availability and GC‑induced latency spikes by introducing a replication‑based client multi‑path read mechanism, configuring hedged reads, and adopting the Z Garbage Collector, and presents the resulting performance improvements and remaining challenges.

Big DataHBaseMulti-Path Read
0 likes · 11 min read
Improving HBase Availability and Reducing Latency Spikes with Replication‑Based Multi‑Path Reads and ZGC
DataFunTalk
DataFunTalk
Feb 12, 2021 · Big Data

Apache Flink at Kuaishou: Past, Present, and Future

Zhao Jianbo, head of Kuaishou's big data architecture team, presents an in‑depth overview of Apache Flink's adoption at Kuaishou, covering reasons for selection, development history, business data flows, technical innovations such as the Slimbase state engine, stability improvements, and future roadmap.

Apache FlinkBig DataKuaishou
0 likes · 16 min read
Apache Flink at Kuaishou: Past, Present, and Future
DataFunTalk
DataFunTalk
Feb 10, 2021 · Big Data

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

The article presents Entropy Simplify's AirWorks data intelligence platform, detailing its three‑layer architecture, cloud‑native multi‑source data ingestion system, low‑code ETL capabilities, technical features such as multi‑engine cooperation and data‑skew handling, and a financial asset‑management case study.

Big DataETLFinancial Services
0 likes · 16 min read
AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case
Alibaba Cloud Native
Alibaba Cloud Native
Feb 10, 2021 · Cloud Native

Accelerate AI and Big Data Workloads on Kubernetes with Fluid’s JindoRuntime

Fluid is an open‑source Kubernetes‑native engine that orchestrates and accelerates distributed datasets for AI and big‑data workloads, and this guide explains its core concepts, the JindoRuntime implementation, performance benefits, and step‑by‑step instructions to deploy and test JindoRuntime on a K8s cluster.

AIBig DataCloud Native
0 likes · 14 min read
Accelerate AI and Big Data Workloads on Kubernetes with Fluid’s JindoRuntime
DataFunTalk
DataFunTalk
Feb 9, 2021 · Big Data

Design and Implementation of a Full‑Chain Marketing Data Product at NetEase Yanxuan

This article details NetEase Yanxuan's business background, market characteristics, data product requirements, and the end‑to‑end design of a full‑chain marketing data product, covering attribution, metric evaluation, analysis frameworks, scenario‑based recommendations, and practical Q&A for data‑driven growth.

Big DataData ProductMetric Evaluation
0 likes · 18 min read
Design and Implementation of a Full‑Chain Marketing Data Product at NetEase Yanxuan
dbaplus Community
dbaplus Community
Feb 9, 2021 · Operations

How Suning Integrated ClickHouse into a Full‑Link Monitoring Platform for Real‑Time OLAP Insights

This article explains how Suning's big‑data team incorporated ClickHouse into their end‑to‑end monitoring ecosystem, detailing the architecture, trace‑ID propagation, slow‑query tracking, MergeTree health checks, replica delay analysis, and the role of Chproxy in delivering comprehensive observability for high‑performance OLAP workloads.

Big DataClickHouseOLAP
0 likes · 15 min read
How Suning Integrated ClickHouse into a Full‑Link Monitoring Platform for Real‑Time OLAP Insights
DataFunTalk
DataFunTalk
Feb 8, 2021 · Big Data

Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS

This article explains how Apache Ozone, built on the HDDS layer, addresses the scalability, memory, and performance limitations of HDFS by splitting metadata services, using RocksDB, implementing fine‑grained locking, RAFT‑based HA, and offering rich APIs, while outlining current challenges and future roadmap.

Big DataHDDSHDFS
0 likes · 29 min read
Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS
Efficient Ops
Efficient Ops
Feb 7, 2021 · Artificial Intelligence

How NLP Transforms Big Data Operations: Real-World AIOps Case Studies

This article explores the intersection of natural language processing and operations, outlines common text‑handling challenges, and presents three concrete AIOps case studies—log Q&A, anomaly detection, and ticket recommendation—while reflecting on a closed‑loop AI workflow and future research directions.

Big DataNLPaiops
0 likes · 9 min read
How NLP Transforms Big Data Operations: Real-World AIOps Case Studies
Architects' Tech Alliance
Architects' Tech Alliance
Feb 7, 2021 · Operations

Understanding the Essence and Implementation of Enterprise Digital Transformation

The article explains what digital transformation truly means for enterprises, outlines its three development stages, describes the core connection‑data‑intelligence framework, compares internal capability rebuilding with external ecosystem integration, and offers practical guidance on why and how companies should embark on digital transformation.

Big DataDigital TransformationEnterprise
0 likes · 24 min read
Understanding the Essence and Implementation of Enterprise Digital Transformation
DataFunTalk
DataFunTalk
Feb 7, 2021 · Big Data

Optimizations and Extensions for Flink SQL in Tencent Real‑Time Computing Platform

This article, presented by Tencent senior engineer Du Li, details the current state of Flink SQL, compares Jar, Canvas, and SQL modes, introduces window‑function extensions, retract‑stream optimizations, and outlines future roadmap plans for cost‑based optimization and new features in the real‑time computing platform.

Big DataFlinkRetract Stream
0 likes · 19 min read
Optimizations and Extensions for Flink SQL in Tencent Real‑Time Computing Platform
Open Source Linux
Open Source Linux
Feb 7, 2021 · Big Data

Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment

This comprehensive guide explains Kafka's role as a message system, detailing topics, partitions, producers, consumers, replication, controller, ZooKeeper coordination, performance optimizations like sequential writes and zero‑copy, and practical recommendations for hardware, configuration, and cluster deployment.

Big DataCluster DeploymentKafka
0 likes · 22 min read
Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment
DataFunTalk
DataFunTalk
Feb 5, 2021 · Big Data

Design and Implementation of Beike's Data Management Platform (DMP)

This article details how Beike built a comprehensive Data Management Platform (DMP) that integrates user behavior and business data across multiple apps, outlines its five‑layer architecture, discusses data collection, processing, storage, real‑time profiling, and presents performance results and future optimization directions.

Big DataDMPTagging System
0 likes · 20 min read
Design and Implementation of Beike's Data Management Platform (DMP)

NetEase Yanxuan Data Task Governance Practice: Pre‑, In‑, and Post‑Operation Strategies

NetEase Yanxuan tackled data‑task governance by establishing pre‑operation guarantees, baseline‑driven in‑operation controls, and post‑operation interventions, delivering stable task output, reduced alarms, lineage awareness, rapid incident recovery, and reusable best‑practice products that earned the 2020 Technology Sharing Co‑building Award.

Baseline ManagementBig DataData Governance
0 likes · 25 min read
NetEase Yanxuan Data Task Governance Practice: Pre‑, In‑, and Post‑Operation Strategies
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Feb 4, 2021 · Big Data

Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics

This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, recommendation dependencies, data permissions, layered data architecture (ODS, DW, DWD, DWS, TDM, ADS), asset management, governance, service APIs, query and analysis services, as well as monitoring, alerting, and operational best practices for building robust big‑data solutions.

Big DataETLdata-warehouse
0 likes · 25 min read
Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Feb 1, 2021 · Big Data

Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts

This article provides a comprehensive introduction to Apache Kafka, covering its distributed publish‑subscribe architecture, its key components such as brokers, topics, partitions, producers, consumers, and ZooKeeper, as well as its advantages, drawbacks, storage mechanisms, partition assignment strategies, and reliability guarantees for high‑throughput big‑data streaming.

Big DataDistributed SystemsMessage Queue
0 likes · 20 min read
Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts
DataFunTalk
DataFunTalk
Feb 1, 2021 · Big Data

Building a Real-Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices

This article presents Tencent's experience of constructing a real‑time data warehouse by integrating Apache Flink with Apache Iceberg, covering background pain points, Iceberg's table format and capabilities, Flink‑Iceberg streaming and batch processing, practical implementations, and future roadmap for data‑lake acceleration.

Apache FlinkApache IcebergBig Data
0 likes · 21 min read
Building a Real-Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices