Tagged articles

3675 articles

Page 22 of 37

May 4, 2021 · Big Data

Overview of CDC Tools: Canal, Maxwell, Databus, and Alibaba DTS

This article introduces four change‑data‑capture solutions—Canal, Maxwell, Databus, and Alibaba Data Transmission Service (DTS)—explaining their principles, processing steps, features, and practical advantages for real‑time data synchronization and migration in big‑data environments.

Alibaba DTSBig DataCDC

0 likes · 6 min read

Overview of CDC Tools: Canal, Maxwell, Databus, and Alibaba DTS

Python Crawling & Data Mining

May 4, 2021 · Big Data

Unlock 100+ Free Data APIs with Just 3 Lines of Python

This article introduces the GoPUP library, which provides over a hundred free data interfaces—including social media indexes, macro‑economic figures, company information, and epidemic statistics—accessible with simple Python code, making data analysis faster and easier.

APIBig DataPython

0 likes · 7 min read

Unlock 100+ Free Data APIs with Just 3 Lines of Python

DataFunTalk

May 2, 2021 · Big Data

Continuous Optimization and Practice of Flink at Kuaishou

This article presents Kuaishou's comprehensive engineering practices for improving Flink's stability, task startup latency, and SQL performance, including high‑availability Kafka connectors, fault‑recovery mechanisms, I/O reductions, asynchronous job upgrades, aggregation optimizations, and future resource‑utilization plans.

Big DataFlinkKafka

0 likes · 10 min read

Continuous Optimization and Practice of Flink at Kuaishou

Architects' Tech Alliance

May 2, 2021 · Big Data

Understanding Data Middle Platform: Concepts, Drivers, Architecture, and Industry Trends

The article explains the concept of a data middle platform, its role in integrating and centralizing enterprise data, the drivers behind its adoption, architectural layers, implementation challenges, market landscape, and real‑world case studies, highlighting how big‑data, cloud and AI technologies enable digital transformation.

AIBig DataDigital Transformation

0 likes · 15 min read

Understanding Data Middle Platform: Concepts, Drivers, Architecture, and Industry Trends

IT Architects Alliance

May 1, 2021 · Big Data

Comprehensive Guide to ELK Stack (Elasticsearch, Logstash, Kibana) Installation, Configuration, and Architecture

This article provides a detailed overview of the ELK stack—including Elasticsearch, Logstash, Kibana, and Beats—explaining its components, why to use it for centralized log management, various deployment architectures, system tuning, security setup, and step‑by‑step installation and configuration commands for a production‑grade environment.

Big DataELKElasticsearch

0 likes · 22 min read

Comprehensive Guide to ELK Stack (Elasticsearch, Logstash, Kibana) Installation, Configuration, and Architecture

Programmer DD

Apr 30, 2021 · Big Data

Kafka 2.8.0 Release: Say Goodbye to ZooKeeper with Raft Metadata Mode

Kafka 2.8.0, released on April 19, 2021, introduces the groundbreaking Raft Metadata mode that eliminates the need for ZooKeeper, alongside numerous new features, bug fixes, and enhancements such as API controls for stream threads, SASL_SSL mutual TLS, and IP rate limiting.

Big DataKafkaRaft

0 likes · 5 min read

Kafka 2.8.0 Release: Say Goodbye to ZooKeeper with Raft Metadata Mode

Tencent Cloud Developer

Apr 29, 2021 · Industry Insights

Future of Databases & Big Data: Insights from the First Techo TVP Summit

The inaugural Techo TVP Developer Summit in Shenzhen gathered over 500 developers to explore the latest trends in databases, distributed systems, big data, and cloud‑native technologies, offering expert analyses, real‑world case studies, and career guidance for data professionals.

Big DataCloud NativeDistributed Systems

0 likes · 19 min read

Future of Databases & Big Data: Insights from the First Techo TVP Summit

Architect

Apr 29, 2021 · Big Data

ELK Stack (Elasticsearch, Logstash, Kibana) Overview, Architecture, Installation, and Configuration Guide (Version 7.7.0)

This article provides a comprehensive introduction to the ELK stack—including component descriptions, architectural diagrams, reasons for adoption, and step‑by‑step installation and configuration of Filebeat, Logstash, Elasticsearch, and Kibana on Linux, with optional Kafka integration for advanced pipelines.

Big DataELKElasticsearch

0 likes · 22 min read

ELK Stack (Elasticsearch, Logstash, Kibana) Overview, Architecture, Installation, and Configuration Guide (Version 7.7.0)

DataFunTalk

Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkBig DataGPU Acceleration

0 likes · 19 min read

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

Practical DevOps Architecture

Apr 28, 2021 · Big Data

Step-by-Step Hadoop Environment Setup and Configuration on Three Linux Servers

This guide walks through preparing three Linux servers, installing JDK 1.8, configuring Hadoop core, HDFS, MapReduce, and YARN XML files, setting Java environment variables, formatting HDFS, and starting all services to access the Hadoop web UI.

Big DataHDFSHadoop

0 likes · 4 min read

Step-by-Step Hadoop Environment Setup and Configuration on Three Linux Servers

DataFunTalk

Apr 27, 2021 · Big Data

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

This article describes how Linkflow migrated mutable customer data from MySQL to an Apache Hudi data lake using Debezium‑in‑Flink CDC, addressing challenges such as snapshot resumability, partial updates, row‑key merging, schema evolution, indexing, and concurrent writes to achieve minute‑level data freshness and improved offline processing performance.

Apache HudiBig DataCDC

0 likes · 21 min read

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

DataFunTalk

Apr 26, 2021 · Big Data

Detailed Design and Practical Application of Apache Iceberg at NetEase Cloud Music

This article explains the motivations behind Apache Iceberg, its design principles such as snapshot and MVCC, compares it with Hive, and describes how NetEase Cloud Music adopted Iceberg to improve metadata handling, query performance, and operational stability for massive daily log data.

Apache IcebergBig DataData Lake

0 likes · 13 min read

Detailed Design and Practical Application of Apache Iceberg at NetEase Cloud Music

Tencent Advertising Technology

Apr 26, 2021 · Artificial Intelligence

Tencent Ad Algorithm Competition and Its Academic Recognition

The Tencent Ad Algorithm Competition, now in its fourth edition, has gained significant academic recognition by aligning with the ACM MM Grand Challenge, introducing new tracks in video advertising technology to address multimedia challenges in the 5G era.

5G TechnologyACM MMBig Data

0 likes · 3 min read

Tencent Ad Algorithm Competition and Its Academic Recognition

DataFunTalk

Apr 23, 2021 · Big Data

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

This article details Zhihu’s transition from a Sqoop‑driven data integration system to a Flink‑centric platform, covering business scenarios, historical architecture, design goals, technology choices, performance optimizations, and future plans for unified streaming‑batch processing across diverse storage systems.

Batch ProcessingBig DataData Integration

0 likes · 14 min read

Building and Evolving Zhihu’s Flink‑Based Data Integration Platform

IT Architects Alliance

Apr 23, 2021 · Industry Insights

Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests

This article provides an in‑depth technical overview of Toutiao’s rapid growth, data collection pipelines, user modeling, cold‑start strategies, recommendation engine architecture, storage solutions, push notification system, microservice design, and its three‑layer PaaS platform, illustrating how the news app serves hundreds of millions of users daily.

Big DataIndustry InsightSystem Architecture

0 likes · 8 min read

Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests

Laravel Tech Community

Apr 22, 2021 · Big Data

Apache Kafka 2.8.0 Release Highlights and New Features

Apache Kafka 2.8.0 introduces several significant enhancements, including a new group API, mutual TLS authentication for SASL_SSL listeners, JSON request/response logging, broker connection rate limiting, topic identifiers, self‑managed quorum replacing ZooKeeper, and numerous improvements to Streams and Connect APIs for more reliable real‑time data pipelines.

Apache KafkaBig DataDistributed Systems

0 likes · 2 min read

Apache Kafka 2.8.0 Release Highlights and New Features

Xianyu Technology

Apr 22, 2021 · Big Data

Real-time Performance Optimization of the Mahé Selection and Delivery System

By classifying data streams, aggregating large‑scale T+1 records in six‑hour windows, encoding attributes with multi‑value mappings, storing compressed rule‑hit backups, and synchronizing recall tables in real time, Mahé’s selection‑and‑delivery pipeline cut end‑to‑end latency from minutes to seconds, achieving robust second‑level responsiveness.

Big DataPerformance OptimizationReal-Time

0 likes · 12 min read

Real-time Performance Optimization of the Mahé Selection and Delivery System

Big Data Technology & Architecture

Apr 22, 2021 · Big Data

Debunking Common Misconceptions About Data Lakes

This article debunks eight common misconceptions about data lakes, explains why they are not mutually exclusive with data warehouses, clarifies that they are not limited to Hadoop or raw data only, and provides practical tips for building flexible, secure, and business‑driven data lake solutions.

AnalyticsBig DataCloud Services

0 likes · 21 min read

Debunking Common Misconceptions About Data Lakes

ITFLY8 Architecture Home

Apr 21, 2021 · Big Data

Designing an Industrial Internet Big Data Platform: Key Strategies

This article presents a comprehensive construction plan for an Industrial Internet big data platform, detailing its overall architecture, data acquisition, edge processing, cloud storage, analytics, security measures, and deployment best practices to enable scalable and reliable industrial IoT solutions.

Big DataData AnalyticsIndustrial Internet

0 likes · 1 min read

Designing an Industrial Internet Big Data Platform: Key Strategies

Full-Stack Internet Architecture

Apr 20, 2021 · Big Data

Building Near Real-Time Elasticsearch Indexes for PB‑Scale Data

This article explains how to construct near real‑time Elasticsearch indexes for petabyte‑level datasets by comparing MySQL limitations, describing Elasticsearch fundamentals, and detailing a pipeline that uses Hive, wide tables, MySQL binlog, Canal, and Otter to achieve second‑level index updates.

Big DataCanalElasticsearch

0 likes · 18 min read

Building Near Real-Time Elasticsearch Indexes for PB‑Scale Data

JD Tech

Apr 20, 2021 · Databases

Space-Filling Curves for Efficient Multidimensional Data Storage and Querying

This article introduces space-filling curves such as Z‑ordering, Hilbert, and XZ‑Ordering, explaining their mapping algorithms and how they transform multidimensional spatial data into one‑dimensional indices for efficient storage and querying in key‑value databases, while discussing challenges and practical examples.

Big DataSpace-filling CurvesSpatial Indexing

0 likes · 12 min read

Space-Filling Curves for Efficient Multidimensional Data Storage and Querying

DataFunTalk

Apr 17, 2021 · Big Data

Evolution of Beike's OLAP Platform Architecture: From Hive‑MySQL to Multi‑Engine Support

This article reviews the evolution of Beike's OLAP platform—from the early Hive‑to‑MySQL stage, through a Kylin‑based architecture, to a flexible multi‑engine solution—detailing the design choices, metric system, engine selection criteria, encountered challenges, and future development plans.

AnalyticsBig DataDruid

0 likes · 24 min read

Evolution of Beike's OLAP Platform Architecture: From Hive‑MySQL to Multi‑Engine Support

Meituan Technology Team

Apr 15, 2021 · Big Data

Data Governance Practices at Meituan Hotel & Travel Platform

Meituan’s hotel‑travel platform tackled exploding data‑quality, cost, efficiency, and security issues by establishing a full‑link governance framework—standardized processes, a Data Management Committee, and unified “One Model, One Logic, One Service, One Portal” systems—that cut per‑unit costs by ~40%, boosted engineer productivity over 60%, eliminated major security incidents, and set the stage for autonomous, AI‑driven data governance.

Big DataData GovernanceData Quality

0 likes · 32 min read

Data Governance Practices at Meituan Hotel & Travel Platform

TAL Education Technology

Apr 15, 2021 · Artificial Intelligence

Tsinghua University and TAL Launch Phase II Collaboration on Intelligent Education Research

On April 15, Tsinghua University's Computer Science Department and TAL Education's Joint Research Center inaugurated Phase II of their partnership to advance intelligent education through AI-driven teaching environments, interactive mechanisms, knowledge‑graph construction, and personalized assessment technologies.

Artificial IntelligenceBig DataCollaboration

0 likes · 7 min read

Tsinghua University and TAL Launch Phase II Collaboration on Intelligent Education Research

dbaplus Community

Apr 14, 2021 · Big Data

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

This guide compiles practical Spark tuning techniques, covering essential configuration parameters, programming best‑practices, detailed shuffle mechanics, and join optimization strategies, while also addressing common errors and mitigation steps, enabling developers to improve performance and resource utilization in large‑scale data processing jobs.

Big DataError HandlingJOIN optimization

0 likes · 25 min read

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

Programmer DD

Apr 14, 2021 · Big Data

Understanding HDFS Architecture: Key Components, Protocols, and Limitations

This article explains HDFS’s master‑slave architecture, detailing the roles of NameNode and DataNode, namespace management, communication protocols, client functions, common configuration parameters, maintenance commands, and the inherent limitations of a single‑NameNode design.

Big DataDataNodeHDFS

0 likes · 5 min read

Understanding HDFS Architecture: Key Components, Protocols, and Limitations

Programmer DD

Apr 13, 2021 · Big Data

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

This article provides a comprehensive overview of HDFS—including its design goals, core components, data read/write workflows, high‑availability mechanisms, federation, storage policies, colocation benefits, and practical usage scenarios—explaining why it is the foundational distributed file system for large‑scale data processing.

Big DataFederationHDFS

0 likes · 17 min read

What Makes HDFS the Backbone of Big Data? Overview, Architecture & Key Features

DevOps

Apr 12, 2021 · Fundamentals

Understanding the Digital Economy: Definition, Evolution, and Why It Matters Now

The article explains what the digital economy is, its relationship with digital transformation, the strategic importance placed on it by China's 14th Five‑Year Plan, and offers guidance for IT professionals on how to respond to this emerging national priority.

Artificial IntelligenceBig DataDigital Economy

0 likes · 14 min read

Understanding the Digital Economy: Definition, Evolution, and Why It Matters Now

DataFunTalk

Apr 9, 2021 · Big Data

iQIYI Data Middle Platform: Architecture, Capabilities, and Future Outlook

This article explains how iQIYI’s data middle platform addresses the rapid growth and challenges of big data by providing a unified, standardized, and service‑oriented architecture that includes data production, processing, governance, metadata, AI‑enhanced services, and a roadmap for future enhancements.

AIBig Dataarchitecture

0 likes · 23 min read

iQIYI Data Middle Platform: Architecture, Capabilities, and Future Outlook

Top Architect

Apr 9, 2021 · Big Data

Technical Architecture and Data Processing of Toutiao News Feed System

This article provides a comprehensive overview of Toutiao's rapid growth, massive user base, data collection pipelines, user modeling, recommendation engine, storage solutions, message push strategies, micro‑service architecture, and virtualization PaaS platform, illustrating how big‑data technologies enable personalized news delivery at scale.

Big DataToutiaodata pipeline

0 likes · 8 min read

Technical Architecture and Data Processing of Toutiao News Feed System

Big Data Technology Architecture

Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopPerformance

0 likes · 12 min read

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

Sohu Tech Products

Apr 7, 2021 · Big Data

Data Warehouse Architecture and Modeling with Alibaba MaxCompute and DataWorks

This tutorial explains how to select a technical architecture, design a three‑layer data warehouse (ODS, CDM, ADS), model tables and dimensions, choose storage strategies, handle slowly changing dimensions, synchronize data with DataWorks, and implement dimensional modeling and fact tables using Alibaba MaxCompute for big‑data analytics.

Big DataDataWorksMaxCompute

0 likes · 32 min read

Data Warehouse Architecture and Modeling with Alibaba MaxCompute and DataWorks

Big Data Technology Architecture

Apr 5, 2021 · Big Data

Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture

The article reviews the current state of offline Hive‑based data warehouses, explains the emergence of real‑time data warehouses (1.0) built on Kafka and Flink, discusses their limitations, and outlines the progression toward batch‑stream unified architectures (2.0 and 3.0) leveraging data‑lake technologies such as Iceberg.

Batch-Stream IntegrationBig DataFlink

0 likes · 13 min read

Evolution of Real‑Time Data Warehouses: From 1.0 to 3.0 and the Road to Batch‑Stream Unified Architecture

Python Crawling & Data Mining

Apr 4, 2021 · Big Data

Mastering User Behavior Analysis: 6 Essential Techniques for Data‑Driven Growth

This article explains six key user‑behavior analysis methods—event analysis, retention analysis, distribution analysis, conversion‑funnel analysis, path analysis, and session analysis—showing how they help businesses understand user actions, optimize product design, improve conversion rates, and boost revenue through data‑driven insights.

Big DataRetention Analysisconversion funnel

0 likes · 11 min read

Mastering User Behavior Analysis: 6 Essential Techniques for Data‑Driven Growth

Architect

Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewPerformance Tuning

0 likes · 47 min read

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

Architect

Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataPerformanceRDD

0 likes · 33 min read

Spark Performance Optimization Guide: Development and Resource Tuning

Alibaba Cloud Native

Apr 2, 2021 · Cloud Native

How Fluid Turns Kubernetes into a High‑Performance Data Logistics System

This article explains how the open‑source Fluid project addresses the inefficiencies of data‑intensive AI and big‑data workloads in cloud‑native Kubernetes environments by introducing a data‑centric abstraction, dual orchestration mechanisms, and seamless integration with Alluxio to achieve faster, secure, and scalable data access.

AlluxioBig DataCloud Native

0 likes · 19 min read

How Fluid Turns Kubernetes into a High‑Performance Data Logistics System

Ctrip Technology

Apr 1, 2021 · Big Data

Design and Implementation of a Binlog‑Based Real‑Time Data Foundation Layer for Ctrip Finance

This article describes how Ctrip Finance built a unified financial data center by collecting MySQL binlog streams with Canal, transporting them via Kafka, persisting to HDFS with Spark‑Streaming, and merging into Hive tables, while addressing performance, idempotency, delete handling, and data‑quality checks.

Big DataBinlogReal-Time

0 likes · 14 min read

Design and Implementation of a Binlog‑Based Real‑Time Data Foundation Layer for Ctrip Finance

DataFunTalk

Mar 29, 2021 · Big Data

Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations

This article details Beike's large‑scale OLAP platform, explaining why Druid was chosen over Kylin, describing the platform's four‑layer architecture, presenting performance and storage benchmarks, and outlining practical improvements to data ingestion, real‑time distinct counting, and cluster stability for high‑concurrency business scenarios.

Big DataDruidOLAP

0 likes · 19 min read

Beike's OLAP Platform: Druid Adoption, Architecture, Performance Comparison, and Operational Optimizations

Programmer DD

Mar 29, 2021 · Big Data

Mastering Kafka: High‑Throughput Distributed Messaging Explained

This comprehensive guide introduces Kafka as a high‑throughput, distributed, publish‑subscribe messaging system, detailing its core concepts, architecture, features, replication, log management, reliability guarantees, and typical use cases such as log collection, real‑time analytics, and cross‑cluster mirroring.

Big DataDistributed MessagingKafka

0 likes · 15 min read

Mastering Kafka: High‑Throughput Distributed Messaging Explained

DataFunTalk

Mar 27, 2021 · Big Data

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

This article presents an in‑depth technical overview of Kuaishou's massive HDFS deployment, detailing its architecture, petabyte‑scale data and thousands‑of‑node clusters, the key scalability challenges faced, and the custom solutions—including FixedOrder, RBF balancer, observer read, slow‑node mitigation, and tiered protection—implemented to keep the system performant and reliable.

Big DataHDFSKuaishou

0 likes · 12 min read

Kuaishou's HDFS Architecture, Scale, Challenges, and Practices

HelloTech

Mar 26, 2021 · Big Data

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

The article describes how algorithm testing teams tackled data‑quality and interface‑semantic monitoring problems by building a unified business monitoring platform that checks table, storage and service consistency, validates response semantics, and, through dashboards, alerts and correction tools, quickly identified dozens of offline and online issues, guiding future reliability enhancements.

AIBig DataData Quality

0 likes · 26 min read

Data Quality and Interface Semantic Monitoring for Algorithm Testing Platform

iQIYI Technical Product Team

Mar 26, 2021 · Big Data

Evolution of iQIYI's Real-Time Big Data Ecosystem

iQIYI transformed its data infrastructure from a traditional offline T+1 model to a comprehensive real‑time ecosystem—leveraging Kafka, Flink, a three‑layer Stream Data Service Platform, the Talos drag‑and‑drop pipeline, and a Druid‑based analytics platform—to enable low‑latency monitoring, personalized recommendations, ad targeting, and continuous machine‑learning workflows while planning future stream‑batch integration and lake‑warehouse convergence.

AnalyticsBig DataFlink

0 likes · 13 min read

Evolution of iQIYI's Real-Time Big Data Ecosystem

Ctrip Technology

Mar 25, 2021 · Big Data

Challenges and Approaches for Real‑Time Data Aggregation Analysis

The article examines the key challenges of real‑time data aggregation—data freshness, timely processing, and result visibility—and surveys common solutions such as timestamp‑based sync, CDC, full and incremental computation, storage formats, and trigger mechanisms.

Big DataCDCIncremental Computation

0 likes · 11 min read

Challenges and Approaches for Real‑Time Data Aggregation Analysis

Suning Technology

Mar 24, 2021 · Big Data

How C2M Is Powering the Industrial Internet Boom in 2021

The article examines how policy‑driven industrial internet initiatives, combined with data‑rich C2M models and AIoT integration, are reshaping manufacturing in China, highlighting Suning's smart‑fridge case, strategic partnerships, and the broader push toward a digital‑first industrial era.

AIoTBig DataC2M

0 likes · 8 min read

How C2M Is Powering the Industrial Internet Boom in 2021

ITFLY8 Architecture Home

Mar 24, 2021 · Big Data

Inside Suning’s Data Platform: How OLAP, Metrics and Visualization Power Business

Suning’s data middle platform integrates an accelerated OLAP engine, a star‑schema metrics system, a standardized visualization tool, and a unified report portal to break data silos, enhance security, and transform traditional enterprises into technology‑driven businesses.

Big DataMetricsOLAP

0 likes · 3 min read

Inside Suning’s Data Platform: How OLAP, Metrics and Visualization Power Business

DataFunTalk

Mar 24, 2021 · Big Data

Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform

This article details how KuJiaLe's big data team replaced their legacy ADB and Presto clusters with a DorisDB MPP database, achieving sub‑second query latency, unified real‑time and offline analytics, simplified ETL pipelines, and significant cost savings while supporting billion‑row tables and high‑QPS workloads.

Big DataDorisDBETL

0 likes · 9 min read

Practical Experience of Using DorisDB for Real-Time and Offline Analytics in KuJiaLe's Big Data Platform

AntTech

Mar 23, 2021 · Big Data

From MapReduce to Ray: The Evolution of Big Data Computing Engines and Career Opportunities

This article traces the history of big‑data computing engines—from early MapReduce and Hadoop through Spark, Storm, Flink, and the newer Ray—explaining their technical advances, real‑world applications in AI and finance, and why graduates should consider a career in this rapidly evolving field.

AIBig DataRay

0 likes · 16 min read

From MapReduce to Ray: The Evolution of Big Data Computing Engines and Career Opportunities

DataFunTalk

Mar 21, 2021 · Big Data

Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations

This article presents ByteDance's recent Flink enhancements, detailing a single‑point recovery mechanism for the network layer and a regional checkpoint strategy that together improve failover latency, reduce output loss, and enable scalable, high‑throughput stream processing for large‑scale real‑time recommendation workloads.

Big DataCheckpointFlink

0 likes · 12 min read

Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations

Architect's Alchemy Furnace

Mar 20, 2021 · Databases

Boost Elasticsearch Performance with Hot‑Cold Data Node Separation

This article explains how to configure Elasticsearch nodes for hot and cold data, assign special node attributes, adjust index templates, and use API calls to migrate data, demonstrating significant query speed improvements through real‑world performance tests.

Big DataElasticsearchNode Configuration

0 likes · 8 min read

Boost Elasticsearch Performance with Hot‑Cold Data Node Separation

dbaplus Community

Mar 20, 2021 · Big Data

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

Big DataDirect ModeHadoop

0 likes · 11 min read

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

Alibaba Terminal Technology

Mar 19, 2021 · Frontend Development

How Alibaba’s Frontend AI Boosts Developer Efficiency on the Feitian Big Data Platform

This article explores Alibaba Cloud's Feitian big data platform and its front‑end intelligent solutions—covering smart editors, code recommendation, code diagnostics, automated visualization, and algorithm engineering—to illustrate how AI enhances developer productivity and product intelligence.

AIAlibaba CloudBig Data

0 likes · 9 min read

How Alibaba’s Frontend AI Boosts Developer Efficiency on the Feitian Big Data Platform

AntTech

Mar 19, 2021 · Artificial Intelligence

Network Effects in Marketing: Graph Neural Network–Based Relationship Prediction and Clustered A/B Testing

This article presents a graph‑neural‑network approach to predict user influence, cluster users with distributed Louvain methods, and conduct network‑aware A/B experiments that accurately evaluate large‑scale marketing campaigns despite strong network effects.

A/B testingBig DataGraph Neural Network

0 likes · 9 min read

Network Effects in Marketing: Graph Neural Network–Based Relationship Prediction and Clustered A/B Testing

Suning Technology

Mar 18, 2021 · Operations

How Suning Carrefour Accelerated Digital Transformation: Lessons in Operations and AI

Suning Carrefour’s rapid digital overhaul since joining Suning in 2019 showcases how AI, big data, and omni‑channel strategies can boost store efficiency, reshape business models, integrate supply chains, and drive high‑growth retail performance.

AIBig DataDigital Transformation

0 likes · 9 min read

How Suning Carrefour Accelerated Digital Transformation: Lessons in Operations and AI

Xianyu Technology

Mar 18, 2021 · Backend Development

Multi-Engine Concurrent Search Architecture for Idlefish

Idlefish’s new multi‑engine concurrent search architecture replaces the tightly‑coupled single‑engine pipeline with deep engine isolation, asynchronous multi‑engine recall, and unified result merging, cutting dump build time from 14 h to 5 h, shrinking memory use dramatically, improving latency by only ~15 ms, and boosting exposure by 50 % and orders by 33 %.

Big DataLuaQuery Planning

0 likes · 10 min read

Multi-Engine Concurrent Search Architecture for Idlefish

Sohu Tech Products

Mar 17, 2021 · Big Data

Understanding Simhash: From Traditional Hash to Random Projection LSH

This article explains the principles and implementation of Simhash, covering the shortcomings of traditional hash functions, the use of cosine similarity, random projection for dimensionality reduction, locality‑sensitive hashing, and practical optimizations for large‑scale duplicate detection.

Big DataCosine SimilarityLocality Sensitive Hashing

0 likes · 24 min read

Understanding Simhash: From Traditional Hash to Random Projection LSH

dbaplus Community

Mar 16, 2021 · Big Data

How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler

This article explains how Kuaishou’s massive offline compute clusters—tens of thousands of machines processing hundreds of petabytes daily—are managed by a heavily customized YARN stack and the home‑grown Kwai Scheduler, detailing architecture, scheduler evolution, multi‑scenario optimizations, and future scaling plans.

Big DataCluster OptimizationKwai Scheduler

0 likes · 14 min read

How Kuaishou Scales YARN to Tens of Thousands of Nodes with the Kwai Scheduler

JD Cloud Developers

Mar 15, 2021 · Artificial Intelligence

Top Tech Weekly: AI Earthquake Monitor, PyTorch 1.8, Language Rankings & More

This developer community weekly roundup highlights CCTV's new big‑data governance platform, RedMonk's programming language rankings, Chromium‑based browsers adopting a four‑week release cycle, PyTorch 1.8 with AMD support, the world’s first AI‑driven earthquake monitoring system, Red Hat OpenShift 4.7, a deep meta‑learning model for city sales prediction, and a CVPR breakthrough in controllable human image generation.

Artificial IntelligenceBig DataCloud Native

0 likes · 9 min read

Top Tech Weekly: AI Earthquake Monitor, PyTorch 1.8, Language Rankings & More

DataFunTalk

Mar 15, 2021 · Big Data

Ten Gotchas When Migrating Spark Jobs to Flink

This article shares ten practical pitfalls encountered while moving hour‑level Spark session processing jobs to Apache Flink, covering parallelism skew, state TTL, checkpoint handling, logging, debugging, state migration, Reduce vs Process, input validation, event‑time handling, and the trade‑offs of storing data inside Flink.

Big DataFlinkPerformance

0 likes · 19 min read

Ten Gotchas When Migrating Spark Jobs to Flink

Code Ape Tech Column

Mar 15, 2021 · Big Data

How to Find Common URLs in 5 Billion-Entry Files with Only 4 GB RAM

Given two files each containing 5 billion 64‑byte URLs (≈320 GB total) and only 4 GB of memory, the solution partitions the URLs by hash modulo 1000 into 1,000 smaller files, then uses hash sets to identify the intersecting URLs efficiently.

Big DataMemory Optimizationhash partition

0 likes · 3 min read

How to Find Common URLs in 5 Billion-Entry Files with Only 4 GB RAM

Python Crawling & Data Mining

Mar 14, 2021 · Artificial Intelligence

Quantitative Investing: Myths, Realities, and How AI Fits In

This article demystifies quantitative investing by explaining its basic concepts, common strategies, historical growth, inherent limitations, and the role of AI and big data, while urging investors to view quant methods as tools rather than a universal solution.

AIBig Datafinancial modeling

0 likes · 13 min read

Quantitative Investing: Myths, Realities, and How AI Fits In

Suning Technology

Mar 13, 2021 · Artificial Intelligence

How Suning’s AI‑Driven Digital Transformation Is Redefining Retail

At the 2021 National Retail CIO Conference in Shanghai, Suning’s Director Wang Junjie detailed the company’s AI, big‑data and cloud‑based three‑step digital transformation strategy, its suite of five mature digital products, and its call for partners to extend these solutions across industries.

Big DataCloud ComputingDigital Transformation

0 likes · 4 min read

How Suning’s AI‑Driven Digital Transformation Is Redefining Retail

vivo Internet Technology

Mar 10, 2021 · Big Data

Path Analysis Model Design and Engineering Implementation for Internet Data Operations

The article details the design and engineering of a high‑performance path analysis model for internet data operations, explaining session handling, Sankey visualizations, adjacency‑table storage, multi‑granular session partitioning, Spark‑to‑ClickHouse pipelines, and optimizations that enable billion‑scale user‑path queries in about one second.

Big DataClickHouseOLAP

0 likes · 21 min read

Path Analysis Model Design and Engineering Implementation for Internet Data Operations

DataFunTalk

Mar 10, 2021 · Big Data

Hive MetaStore Challenges and Optimizations at Kuaishou

At Kuaishou, the Hive MetaStore service, which stores metadata for Hive, faced scalability and performance challenges due to massive dynamic partitions and high query volume, leading to a series of architectural optimizations—including read‑write separation, API enhancements, traffic control, and federation—to improve stability and efficiency.

Big DataKuaishouMetaStore

0 likes · 15 min read

Hive MetaStore Challenges and Optimizations at Kuaishou

Tencent Cloud Developer

Mar 10, 2021 · Cloud Native

How Cloud‑Native Data Lakes Slash Costs and Boost Performance on Public Cloud

The article analyzes the challenges of moving traditional on‑premise big‑data platforms to the cloud, outlines the cost‑saving opportunities of cloud‑native data lakes, presents three core architectural principles, and reviews Tencent Cloud's data lake product suite and its key use cases.

Big DataCloud NativeCost Optimization

0 likes · 11 min read

How Cloud‑Native Data Lakes Slash Costs and Boost Performance on Public Cloud

JD Cloud Developers

Mar 8, 2021 · Artificial Intelligence

Weekly Developer Highlights: Flutter 2, JD Cloud, Flink 1.12.2, AI Breakthroughs

This week’s developer roundup covers Google’s Flutter 2 launch, JD Cloud’s next‑gen server, Apache Flink 1.12.2 bug‑fix release, sidewalk robots classified as pedestrians, Microsoft Mesh mixed‑reality platform, Facebook’s self‑supervised SEER model, plus recent AI research from EMNLP and COLING conferences.

Artificial IntelligenceBig DataFlutter

0 likes · 8 min read

Weekly Developer Highlights: Flutter 2, JD Cloud, Flink 1.12.2, AI Breakthroughs

Top Architect

Mar 5, 2021 · Big Data

Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning

This article explains the architecture and core concepts of Elasticsearch and Lucene, outlines the requirements for cross‑month and high‑speed queries on massive datasets, and provides detailed index and search performance tuning techniques—including bulk writes, shard routing, doc‑values management, and pagination strategies—to achieve sub‑second response times on billions of records.

Big DataElasticsearchIndex Optimization

0 likes · 13 min read

Elasticsearch Indexing and Search Optimization: Principles, Lucene Internals, and Performance Tuning

Big Data Technology Architecture

Mar 4, 2021 · Big Data

Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

This article explores how data clustering techniques such as linear order, Z‑order, and Hilbert‑curve ordering can be applied in Apache Spark and Apache Iceberg to achieve efficient data skipping on terabyte‑scale tables, dramatically reducing file scans and enabling sub‑second interactive analytics for multi‑dimensional queries.

Big DataData ClusteringData Skipping

0 likes · 20 min read

Improving Interactive Analysis on Massive Datasets with Data Clustering and Data Skipping Using Spark and Iceberg

Suning Technology

Mar 3, 2021 · Big Data

How Can China Build a Secure, Free Data Sharing Ecosystem?

The article examines China's push for free public data sharing, highlighting policy directives, the need for top‑level design, security standards, and education to create a unified, safe data‑governance framework that fuels the digital economy.

Big DataData GovernanceDigital Economy

0 likes · 6 min read

How Can China Build a Secure, Free Data Sharing Ecosystem?

21CTO

Mar 2, 2021 · Big Data

How Suning’s Data Platform Unifies OLAP, Metrics, Visualization & Reporting

Suning’s Data Middle Platform integrates an accelerated OLAP engine, a star‑schema metric system, a visualization tool built on standardized dimensions, and a unified report portal to solve data silos, improve security, and enable enterprises to evolve into technology‑driven organizations.

AnalyticsBig DataData Platform

0 likes · 3 min read

How Suning’s Data Platform Unifies OLAP, Metrics, Visualization & Reporting

Laravel Tech Community

Feb 28, 2021 · Big Data

Apache Beam 2.28.0 Release Highlights and New Features

Apache Beam 2.28.0 introduces extensive Parquet support, new hash functions in BeamSQL and ZetaSQL, ApproximateDistinct via HLL, enhanced I/O connectors including SpannerIO for Numeric fields, ParquetIO schema support, KafkaTableProvider thrift, HadoopFormatIO key/value cloning skip, and various other improvements.

Apache BeamBatchBig Data

0 likes · 3 min read

Apache Beam 2.28.0 Release Highlights and New Features

DataFunTalk

Feb 28, 2021 · Big Data

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details how Youzan's offline Spark computing platform was transformed for the cloud‑native era by migrating from YARN to Kubernetes, introducing containerization, storage‑compute separation, dynamic allocation, deployment optimizations, and a collection of practical lessons to reduce cost and improve resource utilization.

Big DataPerformance OptimizationResource Management

0 likes · 27 min read

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

TAL Education Technology

Feb 25, 2021 · Databases

ClickHouse Overview: Architecture, Features, Performance, and Practical Use Cases at TAL Education

This article provides a comprehensive overview of ClickHouse, covering its background, core features, columnar storage, vectorized execution engine, table engines, distributed architecture, performance benchmarks, real‑world deployment at TAL Education, monitoring practices, encountered challenges, and future planning.

Big DataClickHouseColumnar Database

0 likes · 18 min read

ClickHouse Overview: Architecture, Features, Performance, and Practical Use Cases at TAL Education

Python Programming Learning Circle

Feb 25, 2021 · Big Data

Parallel Computing and Python Multiprocessing: Concepts, Models, and Practical Examples

This article explains the fundamentals of parallel computing in the big‑data era, compares parallelism and concurrency, outlines GPU and distributed‑computing solutions, and provides a detailed guide to Python’s multiprocessing module with code examples, performance tests, and practical tips.

Big DataGPUPython

0 likes · 18 min read

Parallel Computing and Python Multiprocessing: Concepts, Models, and Practical Examples

DataFunTalk

Feb 23, 2021 · Big Data

Meituan Hotel & Travel Data Governance: Journey, Practices, and Future Directions

This article outlines Meituan's hotel‑travel data governance evolution, describing the key quality, cost, security, standardization and efficiency challenges faced as the business scaled, and detailing the organizational, technical, metric, service and product‑entry solutions implemented to achieve systematic, measurable, and automated data governance.

Big DataData Governancedata security

0 likes · 19 min read

Meituan Hotel & Travel Data Governance: Journey, Practices, and Future Directions

DataFunTalk

Feb 22, 2021 · Big Data

Optimizing Flink Real-Time Task Resources: Memory and Message Processing Perspectives

This article explores practical methods for optimizing Flink real‑time task resources on Kubernetes, focusing on memory usage analysis via GC logs and message‑processing capacity assessment, proposing automated detection of over‑provisioned memory and CPU, and outlining a workflow for resource adjustment to reduce costs.

Big DataFlinkGC Analysis

0 likes · 18 min read

Optimizing Flink Real-Time Task Resources: Memory and Message Processing Perspectives

dbaplus Community

Feb 18, 2021 · Big Data

How JD Search Scaled Real‑Time Analytics with Flink and Doris

This article details JD Search's journey from a Storm‑based pipeline to a Flink‑driven architecture backed by Apache Doris, covering business requirements, technical challenges, design trade‑offs, performance optimizations for massive traffic spikes, and future plans for their real‑time OLAP data warehouse.

Big DataFlinkOLAP

0 likes · 12 min read

How JD Search Scaled Real‑Time Analytics with Flink and Doris

DataFunTalk

Feb 17, 2021 · Big Data

Apache Iceberg 0.11.0: New Partition Support, SortOrder, Flink Streaming Reader, and Ecosystem Integrations

The article details Apache Iceberg 0.11.0's core enhancements—including partition changes, SortOrder, extensive Flink and Spark integrations, CDC/Upsert support, hash‑based write distribution to reduce small files, and upcoming 0.12.0 roadmap—while providing practical SQL and API examples for data‑lake practitioners.

Apache IcebergBig DataCDC

0 likes · 13 min read

Apache Iceberg 0.11.0: New Partition Support, SortOrder, Flink Streaming Reader, and Ecosystem Integrations

DataFunTalk

Feb 16, 2021 · Big Data

Understanding Presto: Architecture, Query Execution, and Youzan’s Practical Experience

This article explains Presto’s core architecture and low‑latency query execution process, describes how Youzan adopts Presto for various data‑platform scenarios, discusses the evolution of its deployment, and outlines the performance challenges and future enhancements such as Alluxio integration and session property management.

Big DataPerformance OptimizationPresto

0 likes · 13 min read

Architect

Feb 15, 2021 · Big Data

Elasticsearch Optimization Practices for Large-Scale Data Queries

This article explains how to optimize Elasticsearch for cross‑month and multi‑year queries on billions of records, covering Lucene fundamentals, index and search performance tweaks, configuration settings, and practical testing results to achieve sub‑second response times.

Big DataElasticsearchPerformance

0 likes · 14 min read

Elasticsearch Optimization Practices for Large-Scale Data Queries

DataFunTalk

Feb 15, 2021 · Big Data

Flink-Driven Incremental Data Warehouse Production at Meituan: Architecture, Streaming Integration, and Future Plans

This article presents Meituan's use of Flink to enable incremental data warehouse production, covering the warehouse architecture, streaming data integration evolution, real-time OLAP applications, platform design, and future directions for unified stream‑batch processing.

Big DataFlinkIncremental Processing

0 likes · 11 min read

Flink-Driven Incremental Data Warehouse Production at Meituan: Architecture, Streaming Integration, and Future Plans

Architecture Digest

Feb 15, 2021 · Operations

ELK Stack Overview, Architecture, Installation and Configuration Guide (Version 7.7.0)

This article provides a comprehensive introduction to the ELK stack—Elasticsearch, Logstash, Kibana, and Filebeat—including its components, why it’s used for centralized log management, detailed architecture diagrams, step‑by‑step installation commands, configuration examples, and a practical Kafka‑based data pipeline demonstration.

Big DataELKElasticsearch

0 likes · 22 min read

ELK Stack Overview, Architecture, Installation and Configuration Guide (Version 7.7.0)

DataFunTalk

Feb 14, 2021 · Big Data

Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap

This talk presents NetEase's practical experience with Impala, covering its core architecture, new features in version 3.x, integration with Apache Iceberg, a custom management platform, profiling and statistics enhancements, as well as future plans involving Kubernetes, Alluxio caching and pre‑computation strategies.

Apache IcebergBig DataCluster Management

0 likes · 13 min read

Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap

DataFunTalk

Feb 13, 2021 · Databases

Improving HBase Availability and Reducing Latency Spikes with Replication‑Based Multi‑Path Reads and ZGC

This article describes how the Didi HBase team tackled HBase’s weak availability and GC‑induced latency spikes by introducing a replication‑based client multi‑path read mechanism, configuring hedged reads, and adopting the Z Garbage Collector, and presents the resulting performance improvements and remaining challenges.

Big DataHBaseMulti-Path Read

0 likes · 11 min read

Improving HBase Availability and Reducing Latency Spikes with Replication‑Based Multi‑Path Reads and ZGC

DataFunTalk

Feb 12, 2021 · Big Data

Apache Flink at Kuaishou: Past, Present, and Future

Zhao Jianbo, head of Kuaishou's big data architecture team, presents an in‑depth overview of Apache Flink's adoption at Kuaishou, covering reasons for selection, development history, business data flows, technical innovations such as the Slimbase state engine, stability improvements, and future roadmap.

Apache FlinkBig DataKuaishou

0 likes · 16 min read

Apache Flink at Kuaishou: Past, Present, and Future

DataFunTalk

Feb 10, 2021 · Big Data

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

The article presents Entropy Simplify's AirWorks data intelligence platform, detailing its three‑layer architecture, cloud‑native multi‑source data ingestion system, low‑code ETL capabilities, technical features such as multi‑engine cooperation and data‑skew handling, and a financial asset‑management case study.

Big DataETLFinancial Services

0 likes · 16 min read

AirWorks Data Intelligence Platform: Architecture, Cloud‑Native Ingestion, and Financial Asset Management Use Case

Alibaba Cloud Native

Feb 10, 2021 · Cloud Native

Accelerate AI and Big Data Workloads on Kubernetes with Fluid’s JindoRuntime

Fluid is an open‑source Kubernetes‑native engine that orchestrates and accelerates distributed datasets for AI and big‑data workloads, and this guide explains its core concepts, the JindoRuntime implementation, performance benefits, and step‑by‑step instructions to deploy and test JindoRuntime on a K8s cluster.

AIBig DataCloud Native

0 likes · 14 min read

Accelerate AI and Big Data Workloads on Kubernetes with Fluid’s JindoRuntime

DataFunTalk

Feb 9, 2021 · Big Data

Design and Implementation of a Full‑Chain Marketing Data Product at NetEase Yanxuan

This article details NetEase Yanxuan's business background, market characteristics, data product requirements, and the end‑to‑end design of a full‑chain marketing data product, covering attribution, metric evaluation, analysis frameworks, scenario‑based recommendations, and practical Q&A for data‑driven growth.

Big DataData ProductMetric Evaluation

0 likes · 18 min read

Design and Implementation of a Full‑Chain Marketing Data Product at NetEase Yanxuan

dbaplus Community

Feb 9, 2021 · Operations

How Suning Integrated ClickHouse into a Full‑Link Monitoring Platform for Real‑Time OLAP Insights

This article explains how Suning's big‑data team incorporated ClickHouse into their end‑to‑end monitoring ecosystem, detailing the architecture, trace‑ID propagation, slow‑query tracking, MergeTree health checks, replica delay analysis, and the role of Chproxy in delivering comprehensive observability for high‑performance OLAP workloads.

Big DataClickHouseOLAP

0 likes · 15 min read

How Suning Integrated ClickHouse into a Full‑Link Monitoring Platform for Real‑Time OLAP Insights

DataFunTalk

Feb 8, 2021 · Big Data

Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS

This article explains how Apache Ozone, built on the HDDS layer, addresses the scalability, memory, and performance limitations of HDFS by splitting metadata services, using RocksDB, implementing fine‑grained locking, RAFT‑based HA, and offering rich APIs, while outlining current challenges and future roadmap.

Big DataHDDSHDFS

0 likes · 29 min read

Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS

Fangduoduo Tech

Feb 8, 2021 · Big Data

Why Build Your Own Data Lineage Engine? Lessons from Apache Atlas to Duo-Lineage

This article explains what data lineage is, why it is essential for data governance in large‑scale big‑data platforms, compares Apache Atlas with a custom solution, and details the technical choices, architecture, and performance optimizations behind the self‑built duo‑lineage system.

Apache AtlasBig DataData Governance

0 likes · 14 min read

Why Build Your Own Data Lineage Engine? Lessons from Apache Atlas to Duo-Lineage

Efficient Ops

Feb 7, 2021 · Artificial Intelligence

How NLP Transforms Big Data Operations: Real-World AIOps Case Studies

This article explores the intersection of natural language processing and operations, outlines common text‑handling challenges, and presents three concrete AIOps case studies—log Q&A, anomaly detection, and ticket recommendation—while reflecting on a closed‑loop AI workflow and future research directions.

Big DataNLPaiops

0 likes · 9 min read

How NLP Transforms Big Data Operations: Real-World AIOps Case Studies

Architects' Tech Alliance

Feb 7, 2021 · Operations

Understanding the Essence and Implementation of Enterprise Digital Transformation

The article explains what digital transformation truly means for enterprises, outlines its three development stages, describes the core connection‑data‑intelligence framework, compares internal capability rebuilding with external ecosystem integration, and offers practical guidance on why and how companies should embark on digital transformation.

Big DataDigital TransformationEnterprise

0 likes · 24 min read

Understanding the Essence and Implementation of Enterprise Digital Transformation

DataFunTalk

Feb 7, 2021 · Big Data

Optimizations and Extensions for Flink SQL in Tencent Real‑Time Computing Platform

This article, presented by Tencent senior engineer Du Li, details the current state of Flink SQL, compares Jar, Canvas, and SQL modes, introduces window‑function extensions, retract‑stream optimizations, and outlines future roadmap plans for cost‑based optimization and new features in the real‑time computing platform.

Big DataFlinkRetract Stream

0 likes · 19 min read

Optimizations and Extensions for Flink SQL in Tencent Real‑Time Computing Platform

Open Source Linux

Feb 7, 2021 · Big Data

Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment

This comprehensive guide explains Kafka's role as a message system, detailing topics, partitions, producers, consumers, replication, controller, ZooKeeper coordination, performance optimizations like sequential writes and zero‑copy, and practical recommendations for hardware, configuration, and cluster deployment.

Big DataCluster DeploymentKafka

0 likes · 22 min read

Mastering Kafka: Core Concepts, Architecture, and High‑Performance Deployment

DataFunTalk

Feb 5, 2021 · Big Data

Design and Implementation of Beike's Data Management Platform (DMP)

This article details how Beike built a comprehensive Data Management Platform (DMP) that integrates user behavior and business data across multiple apps, outlines its five‑layer architecture, discusses data collection, processing, storage, real‑time profiling, and presents performance results and future optimization directions.

Big DataDMPTagging System

0 likes · 20 min read

Design and Implementation of Beike's Data Management Platform (DMP)

NetEase Yanxuan Technology Product Team

Feb 5, 2021 · Big Data

NetEase Yanxuan Data Task Governance Practice: Pre‑, In‑, and Post‑Operation Strategies

NetEase Yanxuan tackled data‑task governance by establishing pre‑operation guarantees, baseline‑driven in‑operation controls, and post‑operation interventions, delivering stable task output, reduced alarms, lineage awareness, rapid incident recovery, and reusable best‑practice products that earned the 2020 Technology Sharing Co‑building Award.

Baseline ManagementBig DataData Governance

0 likes · 25 min read

NetEase Yanxuan Data Task Governance Practice: Pre‑, In‑, and Post‑Operation Strategies

ITFLY8 Architecture Home

Feb 4, 2021 · Big Data

Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics

This article provides a comprehensive overview of data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, scheduling, baseline control, heterogeneous storage, recommendation dependencies, data permissions, layered data architecture (ODS, DW, DWD, DWS, TDM, ADS), asset management, governance, service APIs, query and analysis services, as well as monitoring, alerting, and operational best practices for building robust big‑data solutions.

Big DataETLdata-warehouse

0 likes · 25 min read

Unlocking Data Middle Platform: From Ingestion to Real‑Time Analytics

Full-Stack Internet Architecture

Feb 1, 2021 · Big Data

Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts

This article provides a comprehensive introduction to Apache Kafka, covering its distributed publish‑subscribe architecture, its key components such as brokers, topics, partitions, producers, consumers, and ZooKeeper, as well as its advantages, drawbacks, storage mechanisms, partition assignment strategies, and reliability guarantees for high‑throughput big‑data streaming.

Big DataDistributed SystemsMessage Queue

0 likes · 20 min read

Kafka Overview: Architecture, Advantages, Disadvantages, and Core Concepts

DataFunTalk

Feb 1, 2021 · Big Data

Building a Real-Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices

This article presents Tencent's experience of constructing a real‑time data warehouse by integrating Apache Flink with Apache Iceberg, covering background pain points, Iceberg's table format and capabilities, Flink‑Iceberg streaming and batch processing, practical implementations, and future roadmap for data‑lake acceleration.

Apache FlinkApache IcebergBig Data

0 likes · 21 min read

Building a Real-Time Data Warehouse with Apache Flink and Apache Iceberg: Architecture, Challenges, and Best Practices