Tagged articles

3675 articles

Page 21 of 37

Jul 11, 2021 · Big Data

Scaling Real‑Time & Offline Analytics with Druid: Architecture, Optimizations, and Lessons

This article explains how Beike adopted the Druid OLAP engine to handle massive real‑time and offline query workloads, detailing its four‑component architecture, key technologies such as deep storage and metadata storage, practical optimizations for data ingestion, query caching, dynamic throttling, timeout control, and a roadmap for future enhancements.

Big DataDruidOLAP

0 likes · 19 min read

Scaling Real‑Time & Offline Analytics with Druid: Architecture, Optimizations, and Lessons

Python Crawling & Data Mining

Jul 10, 2021 · Big Data

Why Tags Are the Core of Data Middle Platforms: Unlock Business Value

This article explains what tags are, how they function as data assets, defines the concept and architecture of a data middle platform, and demonstrates why tags are the pivotal element that enables enterprises to turn raw data into valuable, reusable business services.

Big DataData ArchitectureData Assets

0 likes · 7 min read

Why Tags Are the Core of Data Middle Platforms: Unlock Business Value

Tech Musings

Jul 8, 2021 · Big Data

Building a Simple Single-Node MapReduce System: From Theory to Code

This article walks through implementing a lightweight single‑machine MapReduce framework inspired by the original MapReduce paper, covering the abstract Map/Reduce model, task scheduling between master and workers, core Go code for map, reduce, worker, and coordinator, and a brief reflection on its limitations.

Big DataDistributed SystemsLab

0 likes · 10 min read

Building a Simple Single-Node MapReduce System: From Theory to Code

DataFunTalk

Jul 7, 2021 · Big Data

Solving Data Island Challenges and Enabling Advanced OLAP Analysis on Heterogeneous Big Data Platforms – Kyligence Solution Overview

This article explains the growing analytical demands in the big‑data era, the limitations of traditional OLAP, and how Kyligence’s distributed OLAP engine addresses data‑island issues, multi‑dimensional and many‑to‑many analysis, unified security, and performance optimization with MDX on Spark, delivering a seamless Excel‑like experience.

AnalyticsBig DataData Integration

0 likes · 9 min read

Solving Data Island Challenges and Enabling Advanced OLAP Analysis on Heterogeneous Big Data Platforms – Kyligence Solution Overview

dbaplus Community

Jul 4, 2021 · Big Data

How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture

This article explains Didi's end‑to‑end architecture for ingesting MySQL data into Hive using real‑time Binlog collection, a customized Canal component, message queues, HDFS storage, Dquality monitoring, and strategies for handling data drift and sharding in large‑scale big‑data environments.

Big DataCanalMySQL

0 likes · 13 min read

How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture

DataFunTalk

Jul 2, 2021 · Big Data

Exploring JD Logistics’ Billion‑Scale Data Management and Analytics with Apache Doris

This article details JD Logistics’ challenges in handling petabyte‑level data, outlines their existing data architecture, and explains how they adopted Apache Doris for faster, scalable analytics, covering table management, data import workflows, visualization tools, and future roadmap for data engineering.

Apache DorisBig DataData Governance

0 likes · 14 min read

Exploring JD Logistics’ Billion‑Scale Data Management and Analytics with Apache Doris

37 Mobile Game Tech Team

Jul 2, 2021 · Big Data

Inside Flink Metrics: Adding, Retrieving, and Exposing Metrics in TaskManager

This article walks through Flink's metric system by explaining the core interfaces such as MetricReporter and MetricRegistry, showing how metrics are added, registered, and queried during TaskManager startup, and detailing both REST and Prometheus approaches for retrieving metric values.

Big DataFlinkMetrics

0 likes · 16 min read

Inside Flink Metrics: Adding, Retrieving, and Exposing Metrics in TaskManager

TAL Education Technology

Jul 1, 2021 · Big Data

Optimization of A/B Test Metric Computation Using Spark and ClickHouse

This article details the design and multi‑stage optimization of an A/B testing metric system, describing its product architecture, Spark‑based computation engine, ClickHouse OLAP layer, cumulative calculation improvements, and batch processing techniques that reduced processing time from hours to a few minutes for hundreds of experiments and metrics.

A/B testingBig DataClickHouse

0 likes · 8 min read

Optimization of A/B Test Metric Computation Using Spark and ClickHouse

Architect

Jul 1, 2021 · Big Data

Data Governance Practices at Meituan Hotel Travel Platform

This article presents a comprehensive case study of Meituan's hotel‑travel data governance, covering the background, challenges, strategic goals, standardized processes, technical systems, cost and security optimizations, measurable outcomes, and future plans for automated governance.

Big DataCost OptimizationData Governance

0 likes · 29 min read

Data Governance Practices at Meituan Hotel Travel Platform

Big Data Technology & Architecture

Jul 1, 2021 · Big Data

Data Governance: Concepts, Goals, Methodology, Tools, and Case Studies

This article explains what data governance is, why it is needed, its objectives, core components, implementation methodology, required tools, and real‑world practices from Meituan Delivery and Ant Financial, illustrating how organized data management drives business value and risk control.

Big DataData GovernanceData Management

0 likes · 26 min read

Data Governance: Concepts, Goals, Methodology, Tools, and Case Studies

Youzan Coder

Jun 30, 2021 · Big Data

Online Monitoring Practices for Offline and Real-Time Data at Youzan

Youzan Data Report Center monitors offline batch and real‑time data pipelines using accuracy and timeliness rules, cross‑table checks, upstream‑downstream comparisons, and scheduled alerts to detect anomalies early; since 2021 it has generated over 25 alerts, and plans a unified data‑quality dashboard.

Big DataData QualityFlink

0 likes · 12 min read

Online Monitoring Practices for Offline and Real-Time Data at Youzan

JD Retail Technology

Jun 29, 2021 · Big Data

The Value of Data and Data Products: From Concept to Practice

This article explains how data has become a critical production resource, outlines the limitations of traditional data‑analysis workflows, defines data products and their components, describes their advantages and key characteristics, and shares practical case studies of data‑product implementations in a large e‑commerce environment.

Big DataData ProductData Value

0 likes · 16 min read

The Value of Data and Data Products: From Concept to Practice

DataFunTalk

Jun 26, 2021 · Big Data

Building a Scalable Big Data Service System at Didi: Practices and Lessons

Zhang Liang shares Didi's four-stage journey of constructing and governing large‑scale open‑source big‑data engine services—including engine selection, hardware sizing, PaaS platform building, proxy architecture, and governance—highlighting practical challenges, solutions, and ROI‑driven best practices for Kafka, Elasticsearch, Flink, and related technologies.

Big DataData InfrastructureElasticsearch

0 likes · 16 min read

Building a Scalable Big Data Service System at Didi: Practices and Lessons

Architects Research Society

Jun 26, 2021 · Big Data

Comprehensive Overview of Over 50 Big Data Terms and Technologies

This article presents an extensive glossary of more than fifty big‑data concepts—including Apache projects, data‑analysis methods, storage formats, AI‑related terms, and emerging metrics—providing concise English explanations for each term.

Apache HadoopBig DataData Analytics

0 likes · 17 min read

Comprehensive Overview of Over 50 Big Data Terms and Technologies

Laravel Tech Community

Jun 25, 2021 · Big Data

Apache Kudu 1.15.0 – New Features and Improvements

Apache Kudu 1.15.0 adds experimental multi‑row transaction support (currently INSERT and INSERT_IGNORE), Raft‑based master configuration tools, table comment synchronization with Hive Metastore, per‑table size and row‑count limits configurable via flags or the kudu table set_limit tool, a customizable Kerberos principal flag, and TLS v1.3 with optional cipher‑suite selection, collectively enhancing low‑latency random access and analytical capabilities in the Hadoop ecosystem.

Apache KuduBig DataHadoop

0 likes · 3 min read

Apache Kudu 1.15.0 – New Features and Improvements

DataFunTalk

Jun 25, 2021 · Big Data

Building Data Products and a Data Middle Platform at NetEase Yanxuan: Practices and Lessons

The article details NetEase Yanxuan's end‑to‑end data product ecosystem and data middle platform, describing four core data products, the architecture of the data middle platform, efficient high‑quality delivery, governance practices, and key performance metrics that support data‑driven decision making.

BIBig DataData Governance

0 likes · 14 min read

Building Data Products and a Data Middle Platform at NetEase Yanxuan: Practices and Lessons

Yuewen Technology

Jun 25, 2021 · Big Data

Building Yuedu Group’s Overseas Big Data Platform: Architecture, Offline & Real‑Time Processing

This article details how Yuedu Group designed and implemented an overseas big data platform, covering overall system architecture, offline data‑warehouse construction with dimensional modeling, real‑time streaming using Oceanus and ClickHouse, and future plans for cost reduction and data quality assurance.

Big DataCloud ComputingReal-time Processing

0 likes · 12 min read

Building Yuedu Group’s Overseas Big Data Platform: Architecture, Offline & Real‑Time Processing

Architecture Digest

Jun 24, 2021 · Big Data

Kuaishou's Big Data Service Platform: Architecture, Key Technologies, and Future Outlook

This article introduces Kuaishou's data platform serviceification, outlining the background challenges for data engineers, the platform's architecture and key technologies such as configuration‑driven development, multi‑mode APIs, data acceleration, and high‑availability mechanisms, and concludes with a summary of achievements and future directions.

Big DataData AccelerationData Platform

0 likes · 12 min read

Kuaishou's Big Data Service Platform: Architecture, Key Technologies, and Future Outlook

dbaplus Community

Jun 22, 2021 · Databases

HBase vs Kudu vs ClickHouse: Architecture, Deployment, and Operations Compared

This article provides a side‑by‑side technical comparison of HBase, Kudu, and ClickHouse, covering their installation dependencies, architectural designs, read/write workflows, query capabilities, real‑world use cases at Didi, NetEase, and Ctrip, and practical operational tips.

Big DataClickHouseHBase

0 likes · 20 min read

HBase vs Kudu vs ClickHouse: Architecture, Deployment, and Operations Compared

Didi Tech

Jun 22, 2021 · Big Data

MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices

DiDi’s real‑time MySQL‑to‑Hive pipeline captures row‑mode binlog with a custom Canal component, converts it to JSON, streams it via Kafka to HDFS, restores it into Hive tables, and uses Dquality for integrity, achieving millisecond latency for over 19,000 daily sync tasks handling roughly 50 TB of data.

Big DataBinlogCanal

0 likes · 13 min read

MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices

DevOps

Jun 22, 2021 · Operations

Building Digital Champion Capabilities: Integrating Customer Solutions, Operations, Technology, and Talent Ecosystems

The article outlines how digital‑champion enterprises achieve superior performance by integrating four core ecosystems—customer solutions, operations, technology, and talent—through strategic planning, partnership, and advanced technologies such as AI, big data, and industrial IoT, while highlighting maturity stages and practical implementation steps.

Artificial IntelligenceBig DataDigital Transformation

0 likes · 28 min read

Building Digital Champion Capabilities: Integrating Customer Solutions, Operations, Technology, and Talent Ecosystems

DataFunTalk

Jun 21, 2021 · Big Data

Flink + Iceberg 0.11 Practices in Qunar Data Platform

This article shares Qunar's experience using Flink together with Apache Iceberg 0.11 to address real‑time data warehouse challenges, covering background pain points, Iceberg architecture, solutions for Kafka data loss and Hive latency, and optimization practices such as small‑file handling, sorting, and checkpoint management.

Big DataData LakeFlink

0 likes · 13 min read

Flink + Iceberg 0.11 Practices in Qunar Data Platform

Tencent Cloud Developer

Jun 21, 2021 · Industry Insights

How Hadoop YARN on Kubernetes Pods Supercharge Resource Utilization and Cut Costs

This article explains how Tencent Cloud EMR integrated Hadoop YARN with Kubernetes Pods to create a hybrid online‑offline deployment, implement elastic autoscaling and multi‑label resource allocation, and achieve several‑hundred‑percent improvements in CPU utilization while preserving cluster stability.

Big DataCloud NativeHadoop

0 likes · 11 min read

How Hadoop YARN on Kubernetes Pods Supercharge Resource Utilization and Cut Costs

Architecture Digest

Jun 21, 2021 · Databases

Using HBase for HR Performance Data Preprocessing Platform: Architecture, Concepts, and Best Practices

This article introduces the HR performance data preprocessing platform’s requirements, explains why HBase was selected as the storage solution, details its core concepts, architecture, data write/read processes, best practices, limitations, and presents performance metrics demonstrating its suitability for large‑scale, high‑throughput workloads.

Big DataDatabase ArchitectureHBase

0 likes · 12 min read

Using HBase for HR Performance Data Preprocessing Platform: Architecture, Concepts, and Best Practices

Qunar Tech Salon

Jun 21, 2021 · Big Data

Using Apache Iceberg 0.11 with Flink for Real‑time Data Lake: Architecture, Pain Points, and Solutions

This article examines the challenges of using Kafka, Flink, and Hive for real‑time data warehousing, introduces Apache Iceberg 0.11 as a solution, details its architecture, query planning, Flink integration, code examples, optimization techniques, and summarizes the benefits for large‑scale data processing.

Big DataData LakeFlink

0 likes · 12 min read

Using Apache Iceberg 0.11 with Flink for Real‑time Data Lake: Architecture, Pain Points, and Solutions

DataFunTalk

Jun 20, 2021 · Databases

Xiaohongshu’s OLAP Architecture Evolution and DorisDB Adoption

This article details Xiaohongshu’s multi‑stage evolution of its OLAP infrastructure—from Redshift to Presto, ClickHouse, and finally DorisDB—describing the data pipeline, tool comparisons, advertising use‑case implementation, and the resulting performance and operational benefits.

Big DataClickHouseDorisDB

0 likes · 12 min read

Xiaohongshu’s OLAP Architecture Evolution and DorisDB Adoption

ITFLY8 Architecture Home

Jun 20, 2021 · Big Data

Why HBase Is the Ideal Choice for Large‑Scale HR Data Preprocessing

This article explains how HBase’s distributed column‑oriented architecture, high‑performance read/write capabilities, and flexible schema make it a cost‑effective solution for handling massive, unstructured HR performance data, covering its core concepts, cluster operation, best practices, and performance metrics.

Big DataHBasedata preprocessing

0 likes · 11 min read

Why HBase Is the Ideal Choice for Large‑Scale HR Data Preprocessing

Ctrip Technology

Jun 17, 2021 · Big Data

Data Governance Practices and Cost Optimization at Ctrip's Data Asset Management Platform

The article outlines Ctrip's data governance framework, detailing background challenges, metadata construction, cost and quality optimization techniques, data flow improvements, platform modules, health metrics, and concludes with a summary of achievements and future directions.

Big DataCtripData Governance

0 likes · 13 min read

Data Governance Practices and Cost Optimization at Ctrip's Data Asset Management Platform

Sohu Tech Products

Jun 16, 2021 · Big Data

Understanding Databases, Data Warehouses, Data Lakes, and the Emerging Lake House Architecture

This article explains the fundamental differences between databases, data warehouses, and data lakes, describes how they complement each other, and introduces the Lake House concept that integrates transactional and analytical workloads using cloud services such as Amazon S3, Redshift Spectrum, and Athena.

AWSBig DataData Lake

0 likes · 11 min read

Understanding Databases, Data Warehouses, Data Lakes, and the Emerging Lake House Architecture

Efficient Ops

Jun 16, 2021 · Databases

Mastering ElasticSearch Data Migration and Disaster Recovery: Practical Strategies

This article presents a comprehensive guide to synchronizing heterogeneous data sources with ElasticSearch, migrating clusters across environments, and implementing robust disaster‑recovery solutions for both intra‑city and inter‑city high‑availability scenarios.

Big DataCluster SyncData Migration

0 likes · 16 min read

Mastering ElasticSearch Data Migration and Disaster Recovery: Practical Strategies

DevOps

Jun 16, 2021 · Operations

Understanding Digital Transformation: Definitions, Strategic Questions, Drivers, Frameworks, Roadmaps, Benefits and Pitfalls

The article provides a comprehensive overview of digital transformation, covering its definition, essential strategic questions, key drivers such as customer expectations, cloud and AI, priority areas in the value chain, practical frameworks, roadmap steps, expected benefits and common reasons for failure.

Artificial IntelligenceBig DataBusiness strategy

0 likes · 20 min read

Understanding Digital Transformation: Definitions, Strategic Questions, Drivers, Frameworks, Roadmaps, Benefits and Pitfalls

IT Architects Alliance

Jun 15, 2021 · Industry Insights

How Cloud Computing, Big Data, and AI Intertwine to Power Modern Services

This article explains the evolution of cloud computing from resource management to elastic virtualization, the emergence of IaaS, PaaS and SaaS service models, how big‑data processing relies on distributed cloud platforms, and why artificial intelligence now depends on massive data and cloud‑scale compute to deliver intelligent services.

Artificial IntelligenceBig DataCloud Computing

0 likes · 37 min read

How Cloud Computing, Big Data, and AI Intertwine to Power Modern Services

Baidu Geek Talk

Jun 15, 2021 · Industry Insights

What Baidu Unveiled at QCon 2021: Key Takeaways from 7 Cutting‑Edge Sessions

This article compiles Baidu experts' presentations at QCon 2021, covering unified quality‑efficiency delivery for feed recommendation, software engineering capabilities, AIOps fault‑management practices, Apache Doris real‑time analytics, large‑scale Service Mesh deployment, massive service‑governance techniques, and deep‑learning platform innovations, with speaker details and audience benefits.

AIBaiduBig Data

0 likes · 12 min read

What Baidu Unveiled at QCon 2021: Key Takeaways from 7 Cutting‑Edge Sessions

DataFunTalk

Jun 11, 2021 · Big Data

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

This article explains how to efficiently move large volumes of data from Hive to HBase by leveraging HBase's bulkload mechanism, detailing the original MapReduce workflow, its performance bottlenecks, and a rewritten Spark‑based solution that simplifies ETL, improves partitioning, and achieves several‑fold speedup.

Big DataETLHBase

0 likes · 17 min read

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

Big Data Technology & Architecture

Jun 10, 2021 · Big Data

User Profiling: Concepts, Tag Classification, Tag‑System Construction, Applications and Implementation Steps

This article provides a comprehensive overview of user profiling, covering its definition, the five‑dimensional framework (goal, method, organization, standards, validation), various tag classifications, tag‑system architecture, modeling techniques, practical applications such as precise marketing and product innovation, and a step‑by‑step guide for building a profiling system using big‑data and AI methods.

Big DataCustomer Segmentationdata tagging

0 likes · 24 min read

User Profiling: Concepts, Tag Classification, Tag‑System Construction, Applications and Implementation Steps

Architecture Digest

Jun 10, 2021 · Big Data

NetEase Game Streaming ETL Architecture and Practices Based on Flink

This article presents NetEase Game's streaming ETL solution built on Flink, covering business background, log characteristics, specialized and generic ETL services, architectural evolution, Python UDF integration, runtime optimizations, fault‑tolerance mechanisms, and future roadmap for unified real‑time and offline data warehouses.

Big DataFlinkLog Processing

0 likes · 19 min read

NetEase Game Streaming ETL Architecture and Practices Based on Flink

58 Tech

Jun 9, 2021 · Big Data

Designing and Implementing a Unified Data Metric System for 58 Commercial Data Team

This article explains how 58's commercial data team built a comprehensive data metric system—from identifying common metric definition issues to establishing a domain‑driven hierarchy, distinguishing atomic and derived metrics, implementing a unified metric management platform, and providing APIs and examples for querying and visualizing metrics.

Big DataData Governancejava

0 likes · 17 min read

Designing and Implementing a Unified Data Metric System for 58 Commercial Data Team

Xianyu Technology

Jun 8, 2021 · Big Data

Longgong Data Analysis Platform: Architecture and Solutions for Large‑Scale Structured Data

The Longgong Data Analysis Platform enables Idle Fish to capture, store, and analyze billions of structured product attributes in real time across more than 8,000 categories, using TableStore, MySQL, ODPS, and a distributed scheduler to achieve over 50% query speedup, 80% category coverage, and rapid support for search and recommendation teams.

AlibabaBig DataData Platform

0 likes · 9 min read

Longgong Data Analysis Platform: Architecture and Solutions for Large‑Scale Structured Data

Alibaba Cloud Developer

Jun 8, 2021 · Artificial Intelligence

Can Low‑Code Bridge the Gap Between Business and AI? Insights on Its Future

The article explores how low‑code platforms can complement traditional algorithm development, enhance collaboration between business users and engineers, and accelerate big‑data and AI initiatives by improving data cleaning, modular design, and feedback loops, while highlighting the trade‑offs of abstraction and flexibility.

AIAlgorithm DevelopmentBig Data

0 likes · 9 min read

Can Low‑Code Bridge the Gap Between Business and AI? Insights on Its Future

Big Data Technology & Architecture

Jun 6, 2021 · Big Data

Understanding Data Warehouses: Concepts, Architecture, Modeling, and Governance

This article provides a comprehensive overview of data warehouses, explaining their purpose, differences from databases, OLTP vs OLAP, traditional versus internet data warehouse models, layered architecture, modeling theories, metric dictionaries, date dimensions, naming conventions, data governance, and incremental synchronization techniques with practical SQL examples.

Big DataData GovernanceETL

0 likes · 24 min read

Understanding Data Warehouses: Concepts, Architecture, Modeling, and Governance

DataFunTalk

Jun 6, 2021 · Big Data

Understanding Apache Pulsar: Cloud‑Native Messaging, Storage‑Compute Separation, and Batch‑Stream Fusion with Flink

This article explains Apache Pulsar’s cloud‑native, storage‑compute separated architecture, its data model and scalability features, and how it integrates with Flink to provide a unified platform for both real‑time streaming and batch processing in big‑data applications.

Apache PulsarBatch-Stream IntegrationBig Data

0 likes · 17 min read

Understanding Apache Pulsar: Cloud‑Native Messaging, Storage‑Compute Separation, and Batch‑Stream Fusion with Flink

DataFunTalk

Jun 5, 2021 · Big Data

Building and Evolving a Data Service Platform for NetEase Cloud Music

The article details how NetEase Cloud Music co‑built a unified data service platform with NetEase YouShu, describing its architecture, phased development from internal use to online high‑concurrency services, feature enhancements such as API marketplace, multi‑source support, parameter conversion, and future roadmap for broader data products.

API PlatformBackendBig Data

0 likes · 16 min read

Building and Evolving a Data Service Platform for NetEase Cloud Music

dbaplus Community

Jun 5, 2021 · Big Data

How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming

This article explains the concept of data lakes, outlines a four‑layer open‑source architecture, presents several classic Flink‑Iceberg use cases, details why Iceberg was chosen, and describes the design of Flink’s streaming sink and upcoming community roadmap.

Apache FlinkApache IcebergBig Data

0 likes · 14 min read

How Flink + Iceberg Transform Data Lakes for Real‑Time Streaming

MaGe Linux Operations

Jun 3, 2021 · Big Data

Why Kafka Handles Billions of Messages: Architecture, Use Cases, and Fast Performance

This article introduces Kafka, LinkedIn’s high‑throughput distributed messaging system, explains its core concepts such as brokers, topics, partitions, offsets, producers, consumers, and consumer groups, outlines common use cases like asynchronous decoupling and data‑stream processing, and details its fast performance mechanisms, fault‑tolerance, installation, and configuration steps.

Big DataData StreamingInstallation

0 likes · 11 min read

Why Kafka Handles Billions of Messages: Architecture, Use Cases, and Fast Performance

ITFLY8 Architecture Home

Jun 3, 2021 · Big Data

Building a Real‑Time Flink Recommendation System: Architecture, Code & Deployment

This article walks through a complete Flink‑based recommendation system, detailing its v2.0 architecture, recommendation algorithms, front‑end and back‑end components, and step‑by‑step Docker deployment of MySQL, Redis, HBase, and Kafka services.

Big DataFlinkHBase

0 likes · 10 min read

Building a Real‑Time Flink Recommendation System: Architecture, Code & Deployment

dbaplus Community

Jun 2, 2021 · Databases

How to Build a Mature Data Warehouse: 7 Essential Steps and Best Practices

This article explains why data warehouses are critical for decision‑making, outlines the challenges of immature warehouses, and provides a step‑by‑step framework—including goal setting, technology selection, problem identification, domain modeling, layer design, modeling principles, and governance standards—to help teams build a robust, maintainable data warehouse.

Big DataData ArchitectureDatabase design

0 likes · 22 min read

How to Build a Mature Data Warehouse: 7 Essential Steps and Best Practices

Big Data Technology Architecture

Jun 2, 2021 · Big Data

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

The presentation details NetEase's big data platform operations, covering current usage, the internally built EasyOps control system, a generic service‑operation framework based on Ansible, Prometheus‑Grafana monitoring, configuration management, network and storage optimizations, and lessons learned from cloud migration.

AnsibleBig DataEasyOps

0 likes · 9 min read

Practical Operations of NetEase Big Data Platform: Architecture, EasyOps, Monitoring, and Experience Sharing

Tencent Advertising Technology

Jun 2, 2021 · Big Data

Tencent Advertising Real-Time Strategy Data Framework: Architecture, Performance, and High Availability

The article presents a detailed overview of Tencent Advertising's real‑time strategy data framework, explaining its role in the ad system, the challenges of massive log volumes, and the architectural, performance, and high‑availability solutions implemented to achieve fast, reliable, and scalable ad decision making.

Big DataDistributed SystemsReal-Time Strategy

0 likes · 24 min read

Tencent Advertising Real-Time Strategy Data Framework: Architecture, Performance, and High Availability

dbaplus Community

Jun 1, 2021 · Big Data

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Didi migrated over 10,000 Hive SQL tasks to Spark SQL, achieving 85% Spark task share, cutting execution time by 40%, and reducing CPU and memory usage by 21% and 49% respectively, through a systematic migration process that addressed syntax, UDF, performance, and functional differences between the two engines.

Big DataPerformance OptimizationSQL Migration

0 likes · 20 min read

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Qunar Tech Salon

Jun 1, 2021 · Big Data

Integrating TensorFlow for Java with Spark‑Scala for Distributed Machine Learning Prediction

This article shares practical experience of building a high‑performance distributed prediction service by combining TensorFlow for Java with Spark‑Scala, covering framework selection, performance comparison, model training, loading, inference, deployment, and optimization techniques for large‑scale data processing.

Big DataPerformance OptimizationScala

0 likes · 16 min read

Integrating TensorFlow for Java with Spark‑Scala for Distributed Machine Learning Prediction

Top Architect

May 31, 2021 · Databases

How to Achieve Fast Queries: MySQL Index Optimization, Large‑Table Strategies, Elasticsearch Basics, and HBase Overview

This article explains common causes of slow MySQL queries, how proper indexing and lock handling can improve performance, introduces Elasticsearch’s inverted‑index advantages and suitable use cases, and outlines HBase’s column‑family storage model and row‑key design for large‑scale data.

Big DataDatabase OptimizationHBase

0 likes · 18 min read

How to Achieve Fast Queries: MySQL Index Optimization, Large‑Table Strategies, Elasticsearch Basics, and HBase Overview

IT Architects Alliance

May 30, 2021 · Big Data

NetEase Game Streaming ETL Architecture and Practices Based on Flink

This article presents NetEase Game's Flink‑based streaming ETL system, detailing business background, log classifications, specialized and generic ETL services, Python UDF integration, runtime optimizations, HDFS write tuning, SLA metrics, fault‑tolerance mechanisms, and future roadmap for unified data lakes and PyFlink support.

Big DataData IntegrationETL

0 likes · 19 min read

DataFunTalk

May 28, 2021 · Artificial Intelligence

JD's Open‑Source Federated Learning Solution 9N‑FL: Architecture, Features, Timeline, and Business Impact

This article introduces JD's open‑source federated learning platform 9N‑FL, explaining the data‑island problem, the fundamentals and classifications of federated learning, its four key features, the system’s layered architecture, development timeline, real‑world advertising use case results, and future enhancements.

9N-FLBig DataFederated Learning

0 likes · 15 min read

JD's Open‑Source Federated Learning Solution 9N‑FL: Architecture, Features, Timeline, and Business Impact

58 Tech

May 28, 2021 · Big Data

Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3

This article details the end‑to‑end upgrade of a 5000‑node Hadoop 2.6.0 cluster to Hadoop 3.2.1 at 58.com, covering HDFS migration, RBF and EC adoption, Yarn federation and rolling upgrades, MR3 integration, extensive compatibility testing, and operational lessons learned for large‑scale big‑data platforms.

Big DataCluster UpgradeHDFS

0 likes · 19 min read

Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3

IT Architects Alliance

May 27, 2021 · Big Data

Mastering Data Model Architecture: Layered Design & Naming Best Practices

This article presents a comprehensive guide to data model architecture, detailing layered data store definitions, classification structures, processing flow, naming conventions, and core design principles to help engineers build scalable, maintainable data warehouses.

Big DataData Architecturebest practices

0 likes · 8 min read

Mastering Data Model Architecture: Layered Design & Naming Best Practices

dbaplus Community

May 27, 2021 · Big Data

How Vipshop Scales Billion‑Row OLAP with ClickHouse, Presto, and Flink

This article details Vipshop's OLAP evolution, describing how Presto, Kylin, and ClickHouse are integrated, the deployment architecture with HAproxy and chproxy, containerization on Kubernetes, and the Flink‑ClickHouse pipeline that enables self‑service analysis of hundred‑billion‑row datasets while addressing performance challenges and future roadmap.

Big DataClickHouseFlink

0 likes · 28 min read

How Vipshop Scales Billion‑Row OLAP with ClickHouse, Presto, and Flink

Tencent Cloud Developer

May 27, 2021 · Big Data

An Introduction to Kafka: Architecture, Core Components, Service Governance, Performance Optimizations, and Installation Guide

Kafka is a high‑throughput distributed publish‑subscribe system that uses brokers, topics, partitions, offsets, producers, consumers, and Zookeeper for metadata and leader election, offering fast sequential disk writes, page‑cache zero‑copy transfers, ISR‑based replication, and includes step‑by‑step installation of JDK, Zookeeper, and Kafka.

Big DataDistributed MessagingInstallation

0 likes · 11 min read

An Introduction to Kafka: Architecture, Core Components, Service Governance, Performance Optimizations, and Installation Guide

Top Architect

May 26, 2021 · Big Data

Comprehensive Introduction to Apache Kafka: Concepts, Architecture, Installation, and Usage

This article provides a comprehensive guide to Apache Kafka, covering its core concepts, architecture, key APIs, topics and partitions, deployment steps, multi‑broker clustering, fault tolerance, and data integration using Kafka Connect, with detailed command‑line examples.

Big DataConsumerDistributed Streaming

0 likes · 26 min read

Comprehensive Introduction to Apache Kafka: Concepts, Architecture, Installation, and Usage

IT Architects Alliance

May 25, 2021 · Big Data

How Modern Data Middle Platforms Power Real‑Time and Offline Analytics

This article provides a comprehensive technical overview of data middle platforms, covering data aggregation, offline and real‑time development, smart operations, data asset management, governance, service layers, platform implementations, warehouse layering, and key differences between offline and real‑time data warehouses.

Big DataData GovernanceData Platform

0 likes · 26 min read

How Modern Data Middle Platforms Power Real‑Time and Offline Analytics

Alibaba Terminal Technology

May 25, 2021 · Frontend Development

Inside Alibaba’s Front‑End Visualization Showcase: Insights from CSIG’s Campus‑to‑Enterprise Event

The CSIG Visualization and Visual Analysis Committee’s visit to Alibaba’s Xixi Campus on May 21, 2021 brought together leading academics and industry experts to discuss graph data, big‑data research, spatio‑temporal data, low‑code design, and cutting‑edge visualization techniques, fostering deep industry‑academia collaboration.

Big Dataindustry‑academialow‑code

0 likes · 7 min read

Inside Alibaba’s Front‑End Visualization Showcase: Insights from CSIG’s Campus‑to‑Enterprise Event

Full-Stack Internet Architecture

May 25, 2021 · Backend Development

Comprehensive Interview Experience Summary and Preparation Guide for Major Tech Companies

This article compiles detailed interview experiences, question lists, and practical advice for candidates targeting backend, big‑data, and cloud positions at leading Chinese tech firms, offering timelines, personal background, preparation tips, and reflections to help job seekers navigate multi‑round technical interviews efficiently.

Big DataSystem Designcareer advice

0 likes · 28 min read

Comprehensive Interview Experience Summary and Preparation Guide for Major Tech Companies

Architects Research Society

May 23, 2021 · Big Data

Data Architecture Trends: From Chaos to an Organized Era – Insights from Anthony J. Algmin

The article reviews Anthony J. Algmin’s reflections on past data‑architecture predictions, current hot topics such as cloud, AI/ML, data governance, and real‑time analytics, and forecasts future trends including metadata management, blockchain, and the evolving role of data architects within enterprises.

Artificial IntelligenceBig DataData Architecture

0 likes · 13 min read

Data Architecture Trends: From Chaos to an Organized Era – Insights from Anthony J. Algmin

DataFunTalk

May 22, 2021 · Databases

Combining HBase and Elasticsearch: Challenges and the Lindorm Searchindex Solution

The article examines the strengths and weaknesses of combining HBase and Elasticsearch for massive data storage and retrieval, outlines three integration patterns and their challenges, and presents Alibaba Cloud's Lindorm Searchindex as a SQL‑driven, low‑cost, strongly consistent solution that simplifies development and improves performance.

Big DataElasticsearchHBase

0 likes · 11 min read

Combining HBase and Elasticsearch: Challenges and the Lindorm Searchindex Solution

DeWu Technology

May 22, 2021 · Big Data

Unified Semantic Layer for Data Development: Addressing Pain Points and Optimizing Queries

A unified semantic layer for data development solves metric‑change ripple effects, developer burden, and large‑scale query performance problems by offering consistent metric definitions, multi‑view access, concise auto‑generated SQL, instant propagation of updates, and engine‑driven optimal query selection, thereby bridging business and engineering and cutting maintenance effort.

Big DataOLAPdata engineering

0 likes · 5 min read

Unified Semantic Layer for Data Development: Addressing Pain Points and Optimizing Queries

Top Architect

May 22, 2021 · Big Data

Kafka Basics: Topics, Partitions, Producers, Consumers, and Cluster Architecture

This article provides a comprehensive introduction to Kafka, covering its role as a message system, core concepts such as topics, partitions, producers, consumers, messages, the cluster architecture with replicas and controllers, performance optimizations, log segmentation, and network design, all illustrated with diagrams and code examples.

Big DataKafkaMessage Queue

0 likes · 13 min read

Kafka Basics: Topics, Partitions, Producers, Consumers, and Cluster Architecture

Programmer DD

May 22, 2021 · Big Data

What Is a Data Lake? Origins, Architecture, and How It Powers Modern Big Data

This article explains the concept of a data lake—its origin in 2011, how it differs from traditional databases and data warehouses, its core characteristics such as raw data storage, on‑demand computing, and schema‑on‑read, as well as its advantages, challenges, architectural components, and future outlook within the big‑data ecosystem.

Big DataData ArchitectureData Governance

0 likes · 20 min read

What Is a Data Lake? Origins, Architecture, and How It Powers Modern Big Data

IT Architects Alliance

May 22, 2021 · Big Data

Flink-Based Real‑Time Recommendation System: Architecture, Logic, and Docker Deployment Guide

This article presents a comprehensive walkthrough of a Flink‑powered recommendation system, detailing its v2.0 architecture, module functions, recommendation algorithms (hotness, product similarity, collaborative filtering), front‑end and back‑end UI, and step‑by‑step Docker deployment of MySQL, Redis, HBase, and Kafka services.

Big DataFlinkHBase

0 likes · 11 min read

Flink-Based Real‑Time Recommendation System: Architecture, Logic, and Docker Deployment Guide

NetEase Game Operations Platform

May 22, 2021 · Big Data

Comprehensive Overview and Source Code Analysis of NetEase Spark Kyuubi

This article systematically introduces NetEase Kyuubi, an open‑source high‑performance JDBC and SQL execution engine built on Apache Spark, covering its background, core architecture, service discovery, session and operation management, startup processes, and key source‑code implementations with detailed code examples.

Apache ThriftBig DataKyuubi

0 likes · 47 min read

Comprehensive Overview and Source Code Analysis of NetEase Spark Kyuubi

Tencent Cloud Developer

May 21, 2021 · Big Data

Tencent Cloud Oceanus: Flink SQL Optimization and Extension Practices

Tencent Cloud Oceanus, a computing service powering internal apps like WeChat and external partners such as Bilibili, scales to over 30,000 cores handling 5 PB daily and 500,000 jobs, and tackles Flink SQL’s syntax, function and operational limits with table‑valued functions, incremental and enhanced tumble windows, and caching‑based retraction optimization that cuts downstream data volume up to 30× and improves join performance by about 20 %.

Big DataFlink SQLOceanus

0 likes · 19 min read

Tencent Cloud Oceanus: Flink SQL Optimization and Extension Practices

UCloud Tech

May 21, 2021 · Big Data

How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance

This article explains how UCloud's US3 object storage, combined with a custom Hadoop adapter, separates compute and storage, optimizes file system operations, and leverages caching and specialized APIs to dramatically reduce storage costs and improve read/write performance for large‑scale Hadoop workloads.

Big DataCacheHadoop

0 likes · 13 min read

How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance

iQIYI Technical Product Team

May 21, 2021 · Big Data

Design and Implementation of iQIYI's User Feedback Analysis System

iQIYI built an in‑house user‑feedback analysis system that automatically ingests multi‑channel data, classifies and clusters issues, assesses feedback quality, localizes problems, and streamlines repair closure, boosting recall accuracy, alarm precision, closure rates and reducing cycle time across business lines to enhance user experience.

AIBig Dataclassification

0 likes · 15 min read

Design and Implementation of iQIYI's User Feedback Analysis System

Byte Quality Assurance Team

May 19, 2021 · Big Data

Streaming 102: The World Beyond Batch

This article extends the concepts introduced in Streaming 101 by deeply exploring data processing patterns for unbounded data, covering windowing, watermarks, triggers, accumulation modes, and their practical implications for building robust low‑latency streaming pipelines.

Big DataStreamingTriggers

0 likes · 14 min read

Big Data Technology & Architecture

May 19, 2021 · Big Data

Comprehensive Guide to Data Governance: Metadata, Data Quality, Standards, and Asset Management

This article provides an extensive overview of data governance in the big‑data era, covering common pitfalls, the role of metadata, data quality management, data standardization, and data asset management, and offers practical recommendations for organizations to implement effective governance practices.

Big DataData Asset ManagementData Governance

0 likes · 42 min read

Comprehensive Guide to Data Governance: Metadata, Data Quality, Standards, and Asset Management

Tencent Cloud Developer

May 19, 2021 · Industry Insights

How Cloud‑Native Principles Transform Big Data Infrastructure

The article analyzes how cloud‑native concepts such as DevOps, micro‑services, continuous delivery, and containerization can be applied to big‑data foundations, outlining four guiding principles—industrialized delivery, cost quantification, load‑adaptive scaling, and data‑centric design—and describing concrete Hadoop‑based architectures and Tencent Cloud solutions that lower cost while boosting performance.

Big DataCost OptimizationData Infrastructure

0 likes · 22 min read

How Cloud‑Native Principles Transform Big Data Infrastructure

UCloud Tech

May 18, 2021 · Big Data

Step‑by‑Step Guide to Deploy UCloud’s Free USDP for Big Data

This article provides a comprehensive tutorial on installing UCloud's free USDP version for private big‑data deployments, covering environment preparation, minimum node specifications, resource download, configuration files, one‑click initialization scripts, server startup, web UI access, license acquisition, and optional manual setup procedures.

Big DataLinuxUCloud

0 likes · 16 min read

Step‑by‑Step Guide to Deploy UCloud’s Free USDP for Big Data

Alibaba Cloud Native

May 17, 2021 · Big Data

How Vineyard Accelerates Cloud‑Native Big Data Workflows with Zero‑Copy Memory Sharing

Vineyard, an open‑source distributed memory data‑sharing engine, tackles the inefficiencies of traditional file‑system based big‑data pipelines by enabling zero‑copy, in‑memory object exchange, Kubernetes‑aware scheduling, and plug‑in operators, delivering up to 1.34× faster end‑to‑end execution.

Big DataCloud NativeMemory Sharing

0 likes · 10 min read

How Vineyard Accelerates Cloud‑Native Big Data Workflows with Zero‑Copy Memory Sharing

Beijing SF i-TECH City Technology Team

May 17, 2021 · Artificial Intelligence

AIOps Overview: Concepts, Applications, and Case Studies

This article provides a comprehensive overview of AIOps, covering its definition, evolution from manual to AI-driven operations, core capabilities, and real-world applications in capacity prediction, anomaly detection, and alarm merging, illustrated with case studies from a food‑retail giant and internal logistics.

Artificial IntelligenceBig DataCapacity Prediction

0 likes · 13 min read

AIOps Overview: Concepts, Applications, and Case Studies

Architecture Digest

May 17, 2021 · Big Data

Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices

The article provides a comprehensive technical overview of Toutiao's rapid growth, detailing its massive user base, data collection and processing pipelines, user modeling, cold‑start strategies, recommendation engines, storage solutions, push notification mechanisms, and the underlying microservice and PaaS architecture.

Big DataHadoopKafka

0 likes · 8 min read

Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices

DataFunTalk

May 16, 2021 · Big Data

Efficient Data Update/Delete and Real‑time Processing in the Arctic Lakehouse System

This article explains the evolution from traditional data warehouses to modern lakehouse architectures, introduces the Arctic system’s dynamic hash tree for fast update/delete, describes file splitting with sequence/offset ordering, and compares copy‑on‑write versus merge‑on‑read techniques for achieving low‑latency analytics.

ArcticBig DataCopy-on-Write

0 likes · 12 min read

Efficient Data Update/Delete and Real‑time Processing in the Arctic Lakehouse System

Big Data Technology & Architecture

May 15, 2021 · Big Data

One‑Stop Big Data Platform Construction: Practices from WeBank, Beike, and iQIYI

This article shares practical notes on building a one‑stop big data platform, outlining essential functions such as data extraction, cleaning, storage, analysis, governance, and security, and presents implementation case studies from WeBank, Beike, and iQIYI to illustrate real‑world architectures and solutions.

Big DataData GovernanceData Platform

0 likes · 8 min read

One‑Stop Big Data Platform Construction: Practices from WeBank, Beike, and iQIYI

Architects Research Society

May 15, 2021 · Big Data

Data Warehouse vs Data Lake: Definitions, Differences, and Architectural Considerations

Data warehouses store structured data centrally for reporting and analysis, while data lakes retain raw data in various formats, offering flexible, low‑cost, schema‑on‑read processing; the article explains their definitions, key differences, common misconceptions, and why many organizations now combine both to enable self‑service big‑data analytics.

AnalyticsBig DataData Architecture

0 likes · 21 min read

Data Warehouse vs Data Lake: Definitions, Differences, and Architectural Considerations

DataFunTalk

May 14, 2021 · Big Data

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

This article presents a technical deep‑dive into Bilibili’s evolution from offline to real‑time data processing, describing the challenges of timeliness, ETL, AI feature engineering, and the design of a Flink‑on‑YARN incremental pipeline that supports trillion‑scale message throughput and AI‑driven real‑time applications.

AIBig DataFlink

0 likes · 27 min read

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

HelloTech

May 14, 2021 · Big Data

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

The article describes a real‑time user behavior analysis platform built on a ClickHouse cluster, detailing its architecture, Hive‑to‑ClickHouse data ingestion with user‑ID routing, table designs for behavior and group data, and five analytical methods—event, funnel, path, retention, and attribution—leveraging shard‑level parallelism and custom functions for high efficiency.

AnalyticsBig DataClickHouse

0 likes · 20 min read

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

iQIYI Technical Product Team

May 14, 2021 · Industry Insights

How iQIYI Merges AI, Big Data, and Cloud to Revolutionize Entertainment Production

In a keynote at the 2021 iQIYI World Conference, the CTO outlined how AI, big data, and cloud computing power three intelligent production suites, interactive user features, and immersive XR live concerts, illustrating the company’s tech‑driven strategy to reshape entertainment creation and consumption.

AIBig DataCloud Computing

0 likes · 9 min read

How iQIYI Merges AI, Big Data, and Cloud to Revolutionize Entertainment Production

ITPUB

May 14, 2021 · Big Data

How AnalyticDB Powers Petabyte-Scale Consumer Analytics in Alibaba’s Data Bank

The article details how Alibaba’s Data Bank leverages AnalyticDB’s cold‑hot tiered storage, high‑throughput real‑time writes, and low‑latency OLAP capabilities to handle petabyte‑scale consumer data, support flexible AIPL analysis, crowd profiling, and rapid audience selection while cutting costs and ensuring elasticity during peak events.

AnalyticDBBig DataCold-Hot Storage

0 likes · 14 min read

How AnalyticDB Powers Petabyte-Scale Consumer Analytics in Alibaba’s Data Bank

Volcano Engine Developer Services

May 13, 2021 · Databases

Inside ByteGraph: How ByteDance Built a Scalable Distributed Graph Database

The article offers a comprehensive technical deep‑dive into ByteDance’s home‑grown distributed graph database and graph‑processing engine, ByteGraph, covering its directed‑property graph model, Gremlin query support, multi‑layer architecture, storage strategies for massive data, and real‑world graph‑computing practices.

Big DataByteGraphGraph Database

0 likes · 28 min read

Inside ByteGraph: How ByteDance Built a Scalable Distributed Graph Database

JD Retail Technology

May 13, 2021 · Big Data

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

The article details the development, challenges, and redesign of JD.com’s self‑operated rebate system, describing its early monolithic architecture, data‑intensive processing pipeline, migration to a modular, high‑availability platform built on Spark, Hive, and Elasticsearch, and the resulting performance and operational improvements.

Big DataETLSpark

0 likes · 16 min read

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

DataFunTalk

May 12, 2021 · Big Data

Building a Unified Real‑Time and Offline OLAP Platform with DorisDB at Yuanfudao

The article describes how Yuanfudao's data middle platform built a high‑performance OLAP service using the MPP HOLAP engine DorisDB to unify real‑time and batch analytics, meet low‑latency and high‑concurrency requirements, and support diverse education‑industry use cases such as live‑stream monitoring, advertising, and order analytics.

Big DataDorisDBEducation Technology

0 likes · 13 min read

Building a Unified Real‑Time and Offline OLAP Platform with DorisDB at Yuanfudao

Tencent Advertising Technology

May 12, 2021 · Artificial Intelligence

2021 Tencent Advertising Algorithm Competition Live Streams and Technical Insights

The 2021 Tencent Advertising Algorithm Competition featured live streams on May 10-12, 2021, with experts discussing the competition's technical aspects and practical applications of the Angel distributed machine learning framework.

AIBig Datamachine learning

0 likes · 4 min read

2021 Tencent Advertising Algorithm Competition Live Streams and Technical Insights

Tencent Tech

May 12, 2021 · Big Data

How Tencent Powered China’s 7th Census with Big Data and Cloud Tech

The article explains how China’s seventh national census, covering 1.41 billion people, was conducted using fully electronic data collection, self‑service mini‑programs, massive cloud‑native infrastructure, and high‑performance databases to achieve real‑time processing and unprecedented scale.

Big Datacensusdatabases

0 likes · 8 min read

How Tencent Powered China’s 7th Census with Big Data and Cloud Tech

Yuanfudao Tech

May 12, 2021 · Databases

Building a Unified Real‑time and Offline OLAP Platform with DorisDB at Yuanfudao

Yuanfudao's data middle platform leverages the MPP database DorisDB to create a unified OLAP system that supports both real‑time and batch analytics, handling millions of queries daily with sub‑second latency while meeting complex business requirements across its education services.

Big DataDorisDBOLAP

0 likes · 12 min read

DataFunTalk

May 11, 2021 · Big Data

Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake

This article details Baixin Bank's construction of a Flink‑driven real‑time computing platform integrated with Hudi as a real‑time data lake, covering background, architecture, data collection, transformation, storage layers, technical challenges, future roadmap, and practical lessons for similar big‑data initiatives.

Big DataFlinkHudi

0 likes · 12 min read

Design and Practice of Baixin Bank's Flink‑Based Real‑Time Computing Platform and Hudi‑Powered Real‑Time Data Lake

Big Data Technology & Architecture

May 11, 2021 · Big Data

Data Quality: Dimensions, Rules, and Constraints

The article explains the importance of data quality in the big data era, defines key quality dimensions such as completeness, uniqueness, validity, consistency, accuracy, timeliness, and credibility, and details how each dimension can be measured and enforced through specific constraints and validation rules.

Big DataConsistencyData Governance

0 likes · 9 min read

Data Quality: Dimensions, Rules, and Constraints

Alibaba Cloud Native

May 10, 2021 · Cloud Native

What Is Fluid? A Cloud‑Native Data Orchestration and Acceleration Platform

Fluid is an open‑source cloud‑native data orchestration and acceleration system that runs on Kubernetes, offering storage‑agnostic datasets, distributed caching, intelligent scheduling, and performance optimizations for data‑intensive AI and big‑data workloads.

AIBig DataCloud Native

0 likes · 6 min read

What Is Fluid? A Cloud‑Native Data Orchestration and Acceleration Platform

Architects Research Society

May 9, 2021 · Big Data

Data Lakes vs. Data Warehouses: Key Differences and Choosing the Right Approach

This article explains the fundamental distinctions between data lakes and data warehouses, outlines five critical differences—including data retention, type support, user support, adaptability, and insight speed—and offers guidance on selecting the appropriate solution based on organizational needs and technology options.

AnalyticsBig DataData Architecture

0 likes · 12 min read

Data Lakes vs. Data Warehouses: Key Differences and Choosing the Right Approach

Architecture Digest

May 7, 2021 · Big Data

Comprehensive Overview of Data Middle Platform Architecture and Practices

This article provides a detailed introduction to data middle platform concepts, covering data aggregation, ingestion tools, offline and real‑time development, data governance, service layers, monitoring, and deployment patterns, illustrating how enterprises build unified data ecosystems across various industries.

Big DataData GovernanceData Platform

0 likes · 25 min read

Comprehensive Overview of Data Middle Platform Architecture and Practices

Qu Tech

May 6, 2021 · Big Data

How JuiceFS Cut HDFS Load by 26% and Boost Presto Query Speed 13%

This case study details how integrating JuiceFS with Presto reduced HDFS cluster load by about 26%, achieved over 90% cache hit rate for ad‑hoc queries, and lowered average query latency by roughly 13%, while simplifying operations and improving system stability.

Big DataCacheHDFS

0 likes · 9 min read

How JuiceFS Cut HDFS Load by 26% and Boost Presto Query Speed 13%

21CTO

May 5, 2021 · Big Data

AWS Unveils EMR Studio IDE for Data Scientists, Highlights Linux Kernel Security

AWS introduces a new EMR Studio IDE to accelerate data science workflows, while the Linux community bans University of Minnesota contributions over malicious patches and Google Chrome adopts Intel‑Microsoft hardware‑enforced stack protection to harden browser security.

AWSBig DataCET

0 likes · 6 min read

AWS Unveils EMR Studio IDE for Data Scientists, Highlights Linux Kernel Security

DataFunTalk

May 5, 2021 · Big Data

JD's OLAP Architecture: Design, Challenges, and Solutions

This article explains how JD constructs its OLAP platform from data ingestion to storage, querying, and management, describing the diverse data sources, real‑time and offline processing, scalability, consistency, fault tolerance, and future optimization plans, while addressing key technical challenges and solutions.

Big DataDistributed SystemsJD.com

0 likes · 15 min read

JD's OLAP Architecture: Design, Challenges, and Solutions

DataFunTalk

May 4, 2021 · Big Data

Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome

This article presents the background, requirements, architectural design, component interaction, and implementation details of AutoHome's real‑time data transmission platform built on Apache Flink, highlighting its high availability, exactly‑once semantics, scalability, DDL handling, and integration with existing streaming services.

Apache FlinkBig DataData Streaming

0 likes · 18 min read

Design and Implementation of a Real-Time Data Transmission Platform Based on Apache Flink at AutoHome