Tagged articles
3675 articles
Page 15 of 37
21CTO
21CTO
Oct 30, 2022 · Fundamentals

Top 10 IoT Trends That Will Transform Industries

This article explores the rapid growth of the Internet of Things, outlines the key drivers behind its expansion, highlights major challenges such as chip shortages and bandwidth limits, and presents ten emerging trends—including AI integration, 5G, edge computing, and security—that will shape multiple sectors in the coming years.

5GAIBig Data
0 likes · 9 min read
Top 10 IoT Trends That Will Transform Industries
DataFunSummit
DataFunSummit
Oct 30, 2022 · Big Data

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

This article examines the challenges of traditional Spark clusters and explains how integrating Spark with cloud‑native platforms—through Kubernetes deployment modes, EMR on ACK practices, Remote Shuffle Service, and serverless Spark on DLF—provides elastic scaling, lower operational costs, and advanced features such as executor rolling and custom scheduler support.

Big DataDLFKubernetes
0 likes · 18 min read
Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF
Python Crawling & Data Mining
Python Crawling & Data Mining
Oct 30, 2022 · Big Data

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

This article explains how Ozone, the Hadoop community’s new distributed object‑storage system, overcomes HDFS’s small‑file limitations with a hierarchical Volume‑Bucket‑Object model, detailing its architecture, components, data flow for creating and reading objects, and the benefits of its scalable, fault‑tolerant design.

Big DataHadoopMetadata Management
0 likes · 12 min read
Why Ozone Is the Next‑Generation Distributed Object Store for Big Data
DevOps Cloud Academy
DevOps Cloud Academy
Oct 27, 2022 · Big Data

Understanding DataOps: Concepts, Standards, and Enterprise Practices

This article explains DataOps as a methodology for improving data analysis quality and efficiency, outlines its origins, standards, and maturity model, and presents practical insights and case studies from Chinese enterprises on how DataOps addresses common data engineering challenges and drives digital transformation.

Big DataData GovernanceData Management
0 likes · 12 min read
Understanding DataOps: Concepts, Standards, and Enterprise Practices
Data Thinking Notes
Data Thinking Notes
Oct 27, 2022 · Big Data

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

This article outlines practical Spark job optimization techniques—from code-level improvements and resource tuning to data skew handling, persistence strategies, shuffle reduction, broadcast variables, Kryo serialization, and efficient data structures—demonstrating how each can dramatically cut execution time.

Big DataKryo SerializationPerformance Tuning
0 likes · 19 min read
Boost Spark Performance: Proven Code Optimizations & Tuning Tips
ITPUB
ITPUB
Oct 26, 2022 · Big Data

Why Kafka Stores Data the Way It Does: Inside Its Architecture

This article provides an in‑depth technical analysis of Kafka’s storage architecture, covering its design goals, storage mechanisms, log segment layout, sparse indexing, log cleanup policies, and the performance techniques such as sequential writes, page cache, and zero‑copy that enable high‑throughput streaming.

Big DataLog SegmentsSparse Index
0 likes · 22 min read
Why Kafka Stores Data the Way It Does: Inside Its Architecture
DataFunTalk
DataFunTalk
Oct 26, 2022 · Big Data

Metadata Management and Governance Practices at Wing Payment: Architecture, Techniques, and Future Outlook

This article explains how metadata serves as the foundation of enterprise data governance, outlines common data governance challenges, describes Wing Payment's metadata governance framework and platform architecture, and presents future directions such as multi‑source management, cross‑cluster disaster recovery, and intelligent recommendation.

Big DataData GovernanceData Lineage
0 likes · 18 min read
Metadata Management and Governance Practices at Wing Payment: Architecture, Techniques, and Future Outlook
DataFunSummit
DataFunSummit
Oct 25, 2022 · Databases

Design and Implementation of Meituan's Database Autonomy Service (DAS)

This article presents the background, challenges, architectural design, technical solutions, and future roadmap of Meituan's Database Autonomy Service (DAS), a platform that leverages big‑data collection, AI‑assisted root‑cause analysis, and automated operations to improve database performance, reliability, and self‑service capabilities.

AIBig DataDatabase Autonomy
0 likes · 18 min read
Design and Implementation of Meituan's Database Autonomy Service (DAS)
Kuaishou Big Data
Kuaishou Big Data
Oct 25, 2022 · Big Data

How Kuaishou Built a Scalable Big Data Platform with Unified Data Quality and Metric Services

This article details Kuaishou's end‑to‑end big data platform, describing its organizational model, unified data governance framework, comprehensive data‑quality solution, the design of a headless metric platform, key technologies such as automatic modeling and code generation, and future directions toward a decentralized, smart data fabric.

Big DataData GovernanceData Quality
0 likes · 21 min read
How Kuaishou Built a Scalable Big Data Platform with Unified Data Quality and Metric Services
dbaplus Community
dbaplus Community
Oct 24, 2022 · Big Data

Mastering Data Warehouse Modeling: From ER to Data Vault

This article explains what a data warehouse is, why modeling it matters, and compares four major modeling approaches—ER, dimensional, Data Vault, and Anchor—detailing their structures, steps, advantages, and typical use cases, while also offering guidance on selecting tools and designing models.

Big DataData Vaultdata-warehouse
0 likes · 15 min read
Mastering Data Warehouse Modeling: From ER to Data Vault
DataFunSummit
DataFunSummit
Oct 24, 2022 · Databases

Intelligent Operations: Challenges and Solutions with the IoTDB Time‑Series Database

This article examines the data challenges faced by intelligent operations (AIOps), evaluates IoTDB against other time‑series databases through performance benchmarks, outlines Cloudwise's architecture and open‑source contributions, and presents real‑world case studies demonstrating anomaly detection and root‑cause analysis in industrial settings.

Big DataIoTDBTime Series Database
0 likes · 15 min read
Intelligent Operations: Challenges and Solutions with the IoTDB Time‑Series Database
Data Thinking Notes
Data Thinking Notes
Oct 24, 2022 · Big Data

How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques

This article explains the causes of Spark data skew, how to locate skewed tasks using the Web UI, and presents six optimization methods—including increasing shuffle parallelism, filtering abnormal keys, two‑stage aggregation, map‑join, key sampling, and random‑prefix joins—plus a real‑world case study.

Big DataData SkewJOIN
0 likes · 21 min read
How to Diagnose and Fix Spark Data Skew: Practical Optimization Techniques
Selected Java Interview Questions
Selected Java Interview Questions
Oct 23, 2022 · Big Data

Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse

This article compares Elasticsearch and ClickHouse for log analytics, presents cost‑benefit calculations, and provides a step‑by‑step deployment guide for Zookeeper, Kafka, Filebeat, and ClickHouse to build a scalable, low‑cost data analysis platform for SaaS services.

Big DataClickHouseDeployment
0 likes · 12 min read
Building a Cost‑Effective Data Analysis Platform: ClickHouse vs Elasticsearch and Deployment Guide for Zookeeper, Kafka, Filebeat, and ClickHouse
DataFunSummit
DataFunSummit
Oct 22, 2022 · Big Data

Tencent Music's Data Asset Management and Governance Practices

The article details Tencent Music's data governance journey, describing the background of rapid resource growth, challenges in cost management, a multi‑layered governance methodology—including metadata, tiered storage, and a Lego metadata platform—and the resulting improvements in resource utilization and data quality.

Big DataData GovernanceResource Optimization
0 likes · 14 min read
Tencent Music's Data Asset Management and Governance Practices
Architect's Guide
Architect's Guide
Oct 22, 2022 · Big Data

Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters

This article describes how Meituan’s data platform tackles the growing challenges of a 15,000‑plus‑node Kafka deployment by detailing current bottlenecks, latency‑reduction techniques across application and system layers, large‑scale cluster management strategies, and future directions for robustness and cloud‑native migration.

Big DataKafkaLarge-Scale Clusters
0 likes · 21 min read
Meituan’s Kafka Optimizations: Reducing Read/Write Latency and Managing Large‑Scale Clusters
ITPUB
ITPUB
Oct 21, 2022 · Big Data

Hadoop Explained: Architecture, Core Components, and Real-World Applications

This article provides a comprehensive overview of Hadoop, covering its historical development, key characteristics, the HDFS storage framework, the MapReduce processing engine, YARN resource manager, and a wide range of real-world application scenarios, as well as the broader Hadoop ecosystem and its major components.

Big DataEcosystemHDFS
0 likes · 20 min read
Hadoop Explained: Architecture, Core Components, and Real-World Applications
DataFunSummit
DataFunSummit
Oct 21, 2022 · Big Data

Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg

This article details Xiaohongshu's data platform architecture and three real‑time lake initiatives—log ingestion, CDC ingestion, and lake analysis—showcasing how Apache Iceberg, Flink, and custom shuffling algorithms solve small‑file and cross‑cloud challenges while enabling schema evolution and future multi‑cloud optimizations.

Apache IcebergBig DataCDC
0 likes · 16 min read
Exploring Real‑Time Data Lake Practices at Xiaohongshu Using Apache Iceberg
Bilibili Tech
Bilibili Tech
Oct 21, 2022 · Big Data

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Bilibili adopted the open‑source Kyuubi proxy to replace its unstable STS layer, enabling multi‑tenant, multi‑engine (Spark, Presto, Flink) SQL/Scala processing with Hive Thrift compatibility, fine‑grained queue isolation, UI monitoring, stability safeguards, and Kubernetes/YARN deployment, while planning further cloud‑native extensions.

Big DataKubernetesKyuubi
0 likes · 20 min read
Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing
Hulu Beijing
Hulu Beijing
Oct 21, 2022 · Big Data

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Hulu’s data platform team describes how they migrated large‑scale Spark workloads from Yarn to native Spark on Kubernetes, leveraging AWS services such as EKS, S3, and custom operators to achieve dynamic scaling, unified monitoring, cost‑effective resource management, and improved stability for search, recommendation, and advertising pipelines.

AWSBig DataCloud Native
0 likes · 18 min read
How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale
ITPUB
ITPUB
Oct 20, 2022 · Big Data

Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives

The article examines why Hadoop's Distributed File System may become obsolete by detailing its three main shortcomings—deployment complexity, metadata memory limits, and high replication overhead—and explores how newer architectures and erasure coding could address these issues.

Big DataDistributed File SystemHDFS
0 likes · 8 min read
Will HDFS Be Replaced? Analyzing Its Drawbacks and Future Alternatives
Top Architect
Top Architect
Oct 19, 2022 · Big Data

Elasticsearch Architecture Overview and Core Concepts

This article provides a comprehensive overview of Elasticsearch, covering data types, Lucene fundamentals, cluster architecture, shard allocation, indexing mechanisms, storage strategies, refresh and translog processes, segment merging, performance tuning, and JVM optimization for building scalable, near‑real‑time search solutions.

Big DataClusterElasticsearch
0 likes · 37 min read
Elasticsearch Architecture Overview and Core Concepts
DataFunSummit
DataFunSummit
Oct 18, 2022 · Big Data

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

The article presents a detailed technical walkthrough of Apache Kyuubi 1.5.0, covering its service‑oriented architecture, high‑availability design, multi‑engine extensions for Spark, Flink, Trino and Hive, enhanced engine‑sharing policies, POOL mode configuration, and the project’s future roadmap.

Apache KyuubiBig DataEngine Architecture
0 likes · 13 min read
Feature Overview of Apache Kyuubi (Incubating) v1.5.0
DataFunTalk
DataFunTalk
Oct 17, 2022 · Big Data

How Data Empowers the Fast‑Moving Consumer Goods Industry: Baicaowei’s End‑to‑End Data Platform Evolution

This article details Baicaowei’s journey from a Hadoop‑based data platform to a modern StarRocks‑driven architecture, illustrating how digitalization, evolving business needs, and streamlined data pipelines empower the fast‑moving consumer goods sector through efficient data collection, modeling, and analytics.

Big DataData ArchitectureDigital Transformation
0 likes · 10 min read
How Data Empowers the Fast‑Moving Consumer Goods Industry: Baicaowei’s End‑to‑End Data Platform Evolution
ITPUB
ITPUB
Oct 15, 2022 · Big Data

Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes

This talk introduces the evolution of data lakes, outlines Apache Hudi’s core features, details the Flink‑Hudi integration architecture—including write pipelines, small‑file handling, and read strategies—covers real‑world use cases such as near‑real‑time DB ingestion, OLAP, and ETL, and previews upcoming Hudi roadmap items.

Apache HudiBig DataData Lake
0 likes · 21 min read
Flink & Apache Hudi: Design, Practices, and Roadmap for Streaming Data Lakes
Model Perspective
Model Perspective
Oct 14, 2022 · Artificial Intelligence

How SimRank Leverages Graph Theory for Powerful Recommendations

SimRank, a graph‑theoretic recommendation algorithm, models users and items as a bipartite graph and computes similarity through iterative matrix operations, with extensions like SimRank++ incorporating edge weights and evidence, while scalable solutions use big‑data frameworks or Monte‑Carlo simulations.

Big DataMatrix ComputationSimRank
0 likes · 8 min read
How SimRank Leverages Graph Theory for Powerful Recommendations
Shopee Tech Team
Shopee Tech Team
Oct 13, 2022 · Big Data

Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee

Shopee tackled frequent Flink checkpoint failures caused by back‑pressure by adopting and extending the community’s Unaligned Checkpoint mechanism—adding overdraft buffers, improving legacy sources, introducing an aligned‑checkpoint timeout, enabling output‑buffer switching, merging small HDFS files, and fixing network‑buffer deadlocks—now running hundreds of jobs with stable UC deployment and plans to enable it universally.

Big DataCheckpoint OptimizationFlink
0 likes · 18 min read
Improving Flink Unaligned Checkpoint: Problems, Principles, Optimizations, and Production Practices at Shopee
DataFunSummit
DataFunSummit
Oct 12, 2022 · Big Data

Practical Application of Kyuubi in Xiaomi’s Big Data Platform

This article details how Xiaomi integrated the open‑source Kyuubi SQL gateway into its evolving big‑data platform, describing the challenges of multiple SQL services, the architectural redesign for a unified, high‑availability service, performance gains, new features such as engine pooling and Z‑ordering, and future roadmap plans.

Big DataData PlatformKyuubi
0 likes · 15 min read
Practical Application of Kyuubi in Xiaomi’s Big Data Platform
dbaplus Community
dbaplus Community
Oct 11, 2022 · Big Data

How We Replaced Elasticsearch with ClickHouse for Faster, Cheaper Log Storage

Facing growing log volumes and compliance needs, we evaluated ClickHouse’s hot‑cold‑archive storage to replace Elasticsearch, detailing configuration of storage policies, partitioning strategies, table creation, TTL handling, and cost‑effective OSS integration, ultimately achieving higher write performance and over 50% storage cost reduction.

Big DataClickHouseCold Hot Architecture
0 likes · 22 min read
How We Replaced Elasticsearch with ClickHouse for Faster, Cheaper Log Storage
DataFunSummit
DataFunSummit
Oct 11, 2022 · Big Data

Building Lakehouse Architecture with Delta Lake: Core Concepts, Technologies, Ecosystem, and Use Cases

This article explains how to construct a lakehouse architecture using Delta Lake by covering its basic concepts, version‑2 features, internal kernel and key technologies, ecosystem integrations, and classic data‑warehouse use cases such as G‑SCD and change‑data‑capture, providing practical guidance for modern big‑data engineering.

ACID TransactionsBig DataChange Data Capture
0 likes · 27 min read
Building Lakehouse Architecture with Delta Lake: Core Concepts, Technologies, Ecosystem, and Use Cases
DataFunSummit
DataFunSummit
Oct 10, 2022 · Big Data

Stability Optimization Practices for Flink Jobs at Tencent

This article presents Tencent's practical experience in improving Flink job stability, covering the Oceanus platform, stability challenges, and concrete optimization techniques such as reducing failures, minimizing impact, accelerating recovery, and proactive issue detection, followed by a summary and future outlook.

Big DataFlinkReal‑Time Computing
0 likes · 12 min read
Stability Optimization Practices for Flink Jobs at Tencent
MaGe Linux Operations
MaGe Linux Operations
Oct 9, 2022 · Big Data

Master Flink on Kubernetes: Step‑by‑Step Deployment Guide

This guide walks you through deploying Apache Flink on Kubernetes, covering runtime modes, building Docker images, creating ConfigMaps and Services, launching session and application clusters, submitting jobs, monitoring the Web UI, and cleaning up resources, all with practical code snippets and commands.

Big DataDockerFlink
0 likes · 26 min read
Master Flink on Kubernetes: Step‑by‑Step Deployment Guide
DataFunTalk
DataFunTalk
Oct 9, 2022 · Big Data

Software Localization and the Future of Big Data Platforms in China

The article examines why software localization is essential for China’s data technology, outlines the challenges and current state of domestic operating systems, databases and big‑data platforms, discusses migration and upgrade strategies, and introduces NetEase DataFun’s self‑developed big‑data platform with its features and support.

Big DataChinaPlatform Migration
0 likes · 11 min read
Software Localization and the Future of Big Data Platforms in China

Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform

This article explains how X‑Select’s Data Quality Platform (DQC) addresses common data quality problems in large‑scale data development by defining six quality dimensions, leveraging open‑source solutions such as Apache Griffin and Qualitis, and implementing rule definition, execution, alerting, and workflow interruption within a Spark‑based architecture.

Big DataData PlatformData Quality
0 likes · 15 min read
Solving Real‑World Data Quality Challenges with X‑Select’s DQC Platform
ITPUB
ITPUB
Oct 4, 2022 · Big Data

How Kafka Achieves Million‑TPS with Sequential I/O, MMAP, and Zero‑Copy

This article explains how Kafka attains million‑level transactions per second by leveraging sequential disk writes, memory‑mapped files, zero‑copy data transfer, and batch processing, detailing each technique's mechanics and performance impact.

Big DataHigh ThroughputSequential I/O
0 likes · 10 min read
How Kafka Achieves Million‑TPS with Sequential I/O, MMAP, and Zero‑Copy
DataFunTalk
DataFunTalk
Oct 3, 2022 · Artificial Intelligence

Building Real‑World Medical Knowledge Graphs and Clinical Event Graphs: Methods, Pipelines, and Applications

This article explains how YiduCore processes heterogeneous hospital data (EMR, HIS, LIS, RIS, literature) to construct real‑world medical knowledge graphs and clinical event graphs, detailing pipelines for entity extraction, normalization, graph cleaning, PSR scoring, graph embedding, and showcasing applications such as intelligent diagnosis, question answering, automated medical record generation, and clinical trial patient recruitment.

AIBig DataMedical Knowledge Graph
0 likes · 21 min read
Building Real‑World Medical Knowledge Graphs and Clinical Event Graphs: Methods, Pipelines, and Applications
DataFunTalk
DataFunTalk
Oct 2, 2022 · Big Data

Real-time Data Warehouse Architecture and Hologres Technology Overview

This article explains the evolving requirements of real‑time data warehouses, analyzes Alibaba's Hologres technology principles, presents recommended architectures for various latency scenarios, and discusses practical case studies, performance, security, and cost‑optimization strategies for modern big‑data platforms.

Big DataCloud ComputingHologres
0 likes · 24 min read
Real-time Data Warehouse Architecture and Hologres Technology Overview
Bilibili Tech
Bilibili Tech
Sep 30, 2022 · Big Data

Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg

Bilibili’s new lake‑house platform, built on Trino and Iceberg, replaces Hive‑based pipelines by ingesting logs and DB data into Iceberg tables, applying advanced sorting, Z‑order/Hilbert clustering, bitmap and bloom indexes, virtual join columns and pre‑aggregation, enabling 70 000 daily queries on 2 PB with average scans of 2 GB and sub‑2‑second response times.

Big DataData SkippingIceberg
0 likes · 15 min read
Bilibili's Efficient Lakehouse Platform Built on Trino and Iceberg
Bilibili Tech
Bilibili Tech
Sep 30, 2022 · Big Data

From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications

RoaringBitmap improves traditional BitMap by lazily allocating four container types, compressing sparse data, and dynamically switching between array, bitmap, and run containers, enabling fast exact set operations that power big‑data systems such as Kylin, ClickHouse, and B‑Station’s user‑visit and crowd‑package pipelines, dramatically reducing memory use and processing latency.

Big DataBitmap CompressionClickHouse
0 likes · 16 min read
From BitMap to RoaringBitmap: Principles, Performance, and Big Data Applications
Youzan Coder
Youzan Coder
Sep 29, 2022 · Big Data

Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide

This article explains the growing importance of data lineage in large data warehouses, evaluates three Spark lineage extraction approaches, and provides a detailed, step‑by‑step guide to integrating the open‑source Spline agent—including codeless and programmatic initialization, configuration, dispatcher setup, post‑processing, and known limitations.

Apache SparkBig DataData Governance
0 likes · 16 min read
Implementing Spark Data Lineage with Spline: A Step‑by‑Step Guide
Huolala Tech
Huolala Tech
Sep 29, 2022 · Big Data

How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies

This article details Huolala's comprehensive big‑data cost‑control system—covering data‑asset measurement, budgeting, auxiliary governance, storage tiering, and elastic compute management—to dramatically reduce both storage and compute expenses while maintaining service quality across diverse workloads.

Big Dataelastic scalingresource budgeting
0 likes · 21 min read
How Huolala Cuts Big Data Costs with Hybrid Cloud Strategies
MaGe Linux Operations
MaGe Linux Operations
Sep 28, 2022 · Big Data

Master TransBigData: Python Toolkit for Transportation Big Data

TransBigData is a Python library that streamlines the preprocessing, gridding, visualization, and OD extraction of transportation spatiotemporal datasets such as taxi GPS, bike sharing, and bus data, offering concise, efficient functions for data cleaning, rasterization, interactive mapping, and analytical workflows.

Big DataData visualizationGIS
0 likes · 13 min read
Master TransBigData: Python Toolkit for Transportation Big Data
DataFunSummit
DataFunSummit
Sep 28, 2022 · Big Data

Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream

This article presents a comprehensive overview of using Elasticsearch as a time series engine, covering its motivations, challenges, key features, Alibaba Cloud TimeStream optimizations such as columnar storage, LSM structures, downsampling, and integration with Prometheus and Grafana, while also discussing performance and cost considerations.

Big DataDownsamplingElasticsearch
0 likes · 15 min read
Elasticsearch Time Series Engine: Practices, Challenges, and Alibaba Cloud TimeStream
DataFunSummit
DataFunSummit
Sep 25, 2022 · Big Data

Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi

This article shares Xiaomi's internal practices of Hadoop YARN, covering scheduling and resource optimization, elastic scheduling, node overcommit handling, federation architecture, metadata warehouse construction, and future plans to improve cluster utilization and cost efficiency.

Big DataHadoopYARN
0 likes · 20 min read
Practical Optimizations and Resource Management of Hadoop YARN at Xiaomi
Aikesheng Open Source Community
Aikesheng Open Source Community
Sep 24, 2022 · Databases

Weekly Database and Big Data Article Highlights

This weekly roundup presents a curated selection of high‑quality technical articles and resources on MySQL, database error‑log analysis, big‑data task optimization, SQL injection case studies, and upcoming SQLE development plans, offering readers up‑to‑date insights into database engineering and performance best practices.

Big DataMySQLSQL Auditing
0 likes · 4 min read
Weekly Database and Big Data Article Highlights
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 22, 2022 · Big Data

Graph Computing Algorithms for E‑commerce Anti‑Fraud and Reselling Bot Detection

The Xiaohongshu anti‑fraud team combats sophisticated same‑group and crowdsourced reselling bots by ingesting real‑time transaction streams into a Nebula Graph, using multi‑hop sub‑graph sampling, label propagation, and modularity‑based community detection to identify suspicious clusters, update risk pools, and enforce personalized purchase‑limit rules.

Big Dataanti-fraudbot detection
0 likes · 9 min read
Graph Computing Algorithms for E‑commerce Anti‑Fraud and Reselling Bot Detection
DataFunSummit
DataFunSummit
Sep 21, 2022 · Big Data

Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query

This article details NetEase Yanxuan's DMP tag system, covering platform overview, tag definitions, production pipelines, multi‑layer storage architecture, high‑performance query techniques, and future roadmap, illustrating how data from various sources is transformed into actionable user tags for refined operations.

Apache DorisBig DataDMP
0 likes · 10 min read
Practical Implementation of NetEase Yanxuan DMP Tag System: Architecture, Tag Production, Storage, and High‑Performance Query
Tencent Cloud Developer
Tencent Cloud Developer
Sep 20, 2022 · Information Security

Data Classification and Grading Architecture for Enterprise Data Security

The article details a practical, reusable enterprise architecture for data classification and grading that combines scanning tools, a rule‑engine with hot‑updates, a high‑performance identification service, and a security enforcement platform, addressing massive real‑time data volumes, diverse storage types, cross‑department isolation, and compliance with China’s data security laws.

ArchitectureBig DataCloud Native
0 likes · 14 min read
Data Classification and Grading Architecture for Enterprise Data Security
DataFunSummit
DataFunSummit
Sep 15, 2022 · Big Data

Amazon Real-Time Data Warehouse Architecture and Services Overview

This article reviews the evolution of data warehouse architectures, explains Amazon's serverless real-time data lake design and its key services, and details Amazon Redshift's cloud-native real-time data warehouse features, streaming ingestion, and integrated machine learning capabilities.

AWSAmazon RedshiftBig Data
0 likes · 10 min read
Amazon Real-Time Data Warehouse Architecture and Services Overview
dbaplus Community
dbaplus Community
Sep 14, 2022 · Databases

How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes

This article explains the architecture of Apache Doris, introduces Apache Hudi as a data‑lake format, compares Lambda and Kappa approaches, and details the design, implementation steps, and future roadmap for querying Hudi tables directly from Doris.

Apache DorisApache HudiBig Data
0 likes · 10 min read
How Apache Doris Enables Real‑Time Analysis of Hudi Data Lakes
vivo Internet Technology
vivo Internet Technology
Sep 14, 2022 · Big Data

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

The vivo big‑data team details how they migrated massive real‑time workloads from Kafka to Apache Pulsar, describing cluster‑level bundle and ledger management, retention policies, a Prometheus‑Kafka‑Druid monitoring pipeline, load‑balancing tweaks, client tuning, rapid broker‑failure recovery, and future cloud‑native tracing and migration plans.

Apache PulsarBig DataCluster Management
0 likes · 19 min read
Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization
HomeTech
HomeTech
Sep 13, 2022 · Big Data

Integrating Heterogeneous Data Sources with openLooKeng and Upgrading the Apache Kylin Connector at AutoHome

This article describes how AutoHome tackled the complexity of managing multiple relational, NoSQL, and Hive data stores by adopting openLooKeng for unified, cross‑source SQL queries, outlines its key features such as ANSI‑SQL support, diverse connectors, and query optimizations, and details the custom enhancements made to the Apache Kylin connector to better serve their commercial data analysis workloads.

Big DataConnectorsData Integration
0 likes · 13 min read
Integrating Heterogeneous Data Sources with openLooKeng and Upgrading the Apache Kylin Connector at AutoHome
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 13, 2022 · Big Data

From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture

This article traces the history of data lakes from their 2010 inception with Hadoop through cloud‑native object storage, lakehouse formats like Delta Lake, and Alibaba Cloud's multi‑layer solution, outlining key architectural stages and practical construction challenges for enterprise‑grade implementations.

Alibaba CloudBig DataCloud Native
0 likes · 9 min read
From Hadoop to Cloud‑Native: The Evolution of Data Lakes and Modern Architecture
DataFunSummit
DataFunSummit
Sep 12, 2022 · Big Data

DataFun Summit 2022: Data Integration Platform – SeaTunnel V2 Architecture Evolution and DataOps Practices

The DataFun Summit 2022, held on September 17, gathered leading experts from Baiji Whale Open Source, NetEase, Tapdata, and Alibaba Cloud to share deep technical insights on SeaTunnel V2 architecture, DataOps implementations, and open‑source big‑data studio tools, offering attendees practical guidance for modern data platforms.

ApacheBig DataData Platform
0 likes · 8 min read
DataFun Summit 2022: Data Integration Platform – SeaTunnel V2 Architecture Evolution and DataOps Practices
Tencent Cloud Developer
Tencent Cloud Developer
Sep 9, 2022 · Big Data

Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices

The article explains how data lakes excel at ingesting massive, varied data, data warehouses optimize storage and query performance, and lake‑house architectures combine both strengths—offering scalable, low‑cost storage with high‑speed analytics—highlighting industry solutions from Snowflake, Databricks, and major cloud providers.

AnalyticsBig DataData Lake
0 likes · 8 min read
Data Lake, Data Warehouse, and Lakehouse: Concepts, Architectures, and Industry Practices
Selected Java Interview Questions
Selected Java Interview Questions
Sep 9, 2022 · Databases

Performance Testing and Optimization of ClickHouse and Elasticsearch for High-Concurrency Scenarios

This technical report details the requirement analysis, environment setup, monitoring tools, load‑test scripts, data design, execution results, and optimization recommendations for stress‑testing ClickHouse and Elasticsearch to ensure they can handle high‑concurrency business peaks.

Big DataClickHouseDatabase Optimization
0 likes · 11 min read
Performance Testing and Optimization of ClickHouse and Elasticsearch for High-Concurrency Scenarios
Programmer DD
Programmer DD
Sep 9, 2022 · Big Data

Why Kafka and Pulsar Lead the Distributed Streaming Landscape

This article introduces Apache Kafka and Apache Pulsar, compares their core features such as publish/subscribe messaging, storage, real‑time pipelines, and stream processing, outlines key characteristics like high throughput, scalability and fault tolerance, and explains fundamental concepts and architecture components unique to each platform.

Big DataDistributed StreamingKafka
0 likes · 14 min read
Why Kafka and Pulsar Lead the Distributed Streaming Landscape
JavaEdge
JavaEdge
Sep 7, 2022 · Databases

Understanding HBase: Architecture, Data Model, and Read/Write Mechanics

This article provides a comprehensive overview of HBase, covering its column‑oriented design, core components such as HMaster, RegionServer and ZooKeeper, the data model with column families and row keys, and detailed step‑by‑step write and read processes for distributed storage.

Big DataHBaseNoSQL
0 likes · 16 min read
Understanding HBase: Architecture, Data Model, and Read/Write Mechanics
DataFunSummit
DataFunSummit
Sep 7, 2022 · Big Data

Integrating Apache Doris with Hudi: Architecture, Design, and Implementation

This article explains the background, architecture, design choices, and step‑by‑step implementation for enabling Apache Doris to query Hudi data lake tables, covering Doris features, Hudi formats, Lambda/Kappa architectures, solution alternatives, and future roadmap for real‑time analytics.

Apache DorisBig DataData Lake
0 likes · 10 min read
Integrating Apache Doris with Hudi: Architecture, Design, and Implementation
ShiZhen AI
ShiZhen AI
Sep 7, 2022 · Big Data

Getting Started with DataHub: A One‑Stop Guide to Metadata Governance

This article walks you through the fundamentals of data governance, explains metadata management concepts, compares traditional tools with DataHub, and provides a step‑by‑step tutorial for installing Docker, Python, and DataHub 0.8.20 on CentOS 7, ingesting MySQL metadata, and exploring the UI.

Big DataData GovernanceDataHub
0 likes · 19 min read
Getting Started with DataHub: A One‑Stop Guide to Metadata Governance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Sep 6, 2022 · Big Data

How China’s Universities Are Redesigning Big Data Education: Insights from the 2nd Virtual Research Meeting

The second virtual research meeting of China’s Data Science Curriculum Group gathered nearly a hundred educators and industry partners in Beijing to discuss new models for big‑data course design, curriculum construction, industry‑academia collaboration, and digital teaching platforms across multiple universities.

Big DataCurriculum DesignData Science
0 likes · 5 min read
How China’s Universities Are Redesigning Big Data Education: Insights from the 2nd Virtual Research Meeting
DaTaobao Tech
DaTaobao Tech
Sep 6, 2022 · Big Data

SQL Optimization Techniques for ODPS (Open Data Processing Service)

The article presents practical ODPS SQL optimization strategies—including explicit column selection, partition limiting, multi‑insert, proper handling of nulls, join‑type choices, map‑join and skew hints, bucketed tables, and tuned task parameters—illustrated with three real‑world cases that dramatically cut execution time and resource usage.

Big DataData SkewODPS
0 likes · 23 min read
SQL Optimization Techniques for ODPS (Open Data Processing Service)
Bilibili Tech
Bilibili Tech
Sep 6, 2022 · Big Data

Lancer: Evolution of Bilibili's Real-Time Streaming Architecture

Lancer, Bilibili’s real‑time streaming backbone, has evolved from a monolithic Flume pipeline to a log‑id‑isolated, Kubernetes‑native architecture where Go edge agents feed synchronous Kafka‑proxied gateways into per‑logid topics processed by dedicated Flink‑SQL jobs, delivering exactly‑once, back‑pressured, highly scalable data ingestion for billions of daily requests.

ArchitectureBig DataFlink
0 likes · 29 min read
Lancer: Evolution of Bilibili's Real-Time Streaming Architecture
DevOps
DevOps
Sep 5, 2022 · Big Data

Why Informationization Is Not Equal to Digitalization: Insights for Enterprise Digital Transformation

The article explains the fundamental differences between informationization and digitalization, outlines how enterprises can bridge the gap through data‑driven strategies, and provides practical frameworks and case studies such as Netflix and Huawei to guide traditional manufacturers in successful digital transformation.

Big DataData-drivenDigital Transformation
0 likes · 13 min read
Why Informationization Is Not Equal to Digitalization: Insights for Enterprise Digital Transformation
DataFunTalk
DataFunTalk
Sep 4, 2022 · Big Data

Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution

This article describes Bilibili's offline multi‑datacenter architecture, explaining why a scale‑out approach was chosen over scale‑up, and detailing the unit‑based design, job placement, data replication, routing, versioning, bandwidth throttling, traffic analysis, and the operational results and future directions.

Big DataHDFSJob Scheduling
0 likes · 24 min read
Design and Implementation of Bilibili's Offline Multi‑Datacenter Solution
DataFunSummit
DataFunSummit
Sep 2, 2022 · Big Data

ZhongAn Insurance Data Platform: Digital Transformation, 4633 Framework, and Real‑time Data Warehouse with StarRocks

This article details ZhongAn Insurance's digital transformation through its 4633 data‑centric framework, the architecture of its JiZhi data platform, the challenges of its original ClickHouse‑based real‑time warehouse, and how migrating to StarRocks improved performance, scalability, and operational efficiency across advertising and insurance use cases.

Big DataData PlatformDigital Transformation
0 likes · 13 min read
ZhongAn Insurance Data Platform: Digital Transformation, 4633 Framework, and Real‑time Data Warehouse with StarRocks
Shopee Tech Team
Shopee Tech Team
Sep 2, 2022 · Big Data

Shopee Data System Challenges and Apache Hudi Practices

Shopee tackled its data‑system bottlenecks by customizing Apache Hudi to provide unified stream‑batch integration, efficient state‑detail snapshots, and low‑latency wide‑table generation, using CDC‑based bootstrapping, COW/MOR tables, savepoints and partial updates, which cut latency to ten minutes, lowered resource use, and yielded several community‑backed enhancements.

Apache HudiBig DataData Integration
0 likes · 18 min read
Shopee Data System Challenges and Apache Hudi Practices
Aikesheng Open Source Community
Aikesheng Open Source Community
Aug 31, 2022 · Big Data

Tencent's Big Data Construction: Philosophy, Architecture Evolution, and Open‑Source Strategy

The article introduces Tencent's big‑data platform philosophy and overall architecture, detailing three generations of evolution from offline Hadoop‑based processing to real‑time Spark/Storm integration and finally AI‑driven machine‑learning platforms, while also highlighting the team, book publication, and a related giveaway event.

ArchitectureBig DataCloud Native
0 likes · 12 min read
Tencent's Big Data Construction: Philosophy, Architecture Evolution, and Open‑Source Strategy
IT Architects Alliance
IT Architects Alliance
Aug 30, 2022 · Big Data

Understanding Kafka: Architecture, Topics, Partitions, Producers, Consumers, Offsets, Transactions, and Configuration

This article provides a comprehensive overview of Apache Kafka, explaining its distributed message‑queue architecture, the role of topics and partitions, producer and consumer workflows, leader election, offset management, consumer‑group rebalancing, delivery semantics, transaction processing, file organization, and key configuration settings.

Big DataDistributed MessagingKafka
0 likes · 17 min read
Understanding Kafka: Architecture, Topics, Partitions, Producers, Consumers, Offsets, Transactions, and Configuration
DataFunSummit
DataFunSummit
Aug 30, 2022 · Operations

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

This article presents the design, implementation, and evaluation of CloudRCA, an intelligent root cause analysis framework for Alibaba Cloud's big‑data computing services, detailing challenges such as heterogeneous data, sample imbalance, and real‑time constraints, and describing the multi‑stage data processing, hierarchical Bayesian modeling, and deployment results that reduce MTTR by 20%.

Big DataOperationsRoot Cause Analysis
0 likes · 16 min read
CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

How to Build a Unified Big Data Security Platform with Ranger and Custom Authorization

This article explains the design and implementation of a unified data security control platform that protects user privacy and corporate data across multiple big‑data components (Hive, Hetu, GaussDB) by integrating Apache Ranger, custom authorization APIs, asynchronous processing, distributed locking, and SDK‑based authentication to achieve fine‑grained, one‑stop permission management.

AuthorizationBig DataDistributed Systems
0 likes · 17 min read
How to Build a Unified Big Data Security Platform with Ranger and Custom Authorization
Architects' Tech Alliance
Architects' Tech Alliance
Aug 28, 2022 · Databases

Data Replication: Fundamentals, Technologies, and Industry Trends

The article explains data replication concepts, processes, and technologies across storage hardware, operating system, and database layers, outlines synchronous, asynchronous, and hybrid methods, discusses industry applications, trends such as hardware‑software decoupling, cloud replication, and big‑data real‑time copying, and highlights challenges and future directions.

Big Dataclouddata replication
0 likes · 14 min read
Data Replication: Fundamentals, Technologies, and Industry Trends
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Aug 26, 2022 · Cloud Computing

How Baidu Cloud Flow Log Boosts Network Visibility and Cuts Costs

Baidu Intelligent Cloud's Flow Log product provides real‑time, high‑throughput network flow collection, visualization, and analysis for VPC, dedicated line, and NAT gateways, enabling fault diagnosis, cost allocation, elephant‑flow management, and security inspection across ultra‑large scale cloud environments.

Big DataCloud ComputingCost Management
0 likes · 10 min read
How Baidu Cloud Flow Log Boosts Network Visibility and Cuts Costs
ByteDance Data Platform
ByteDance Data Platform
Aug 24, 2022 · Big Data

How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation

This article explains ByteDance's end‑to‑end data‑point (埋点) validation system, covering its technical challenges—usability, accuracy, real‑time visibility, stability, and extensibility—along with SDK integration, QR‑code workflow, JSON‑Schema verification, push‑service architecture, SLA metrics, and future automation plans.

Big DataJSON SchemaPush Service
0 likes · 11 min read
How ByteDance Guarantees Real‑Time Data Point Quality with Scalable Validation
Python Programming Learning Circle
Python Programming Learning Circle
Aug 22, 2022 · Big Data

20 Data Visualization Tools: From Entry‑Level to Expert Solutions

This article surveys twenty data‑visualization tools—covering entry‑level options like Excel, online JavaScript libraries such as D3 and Google Chart API, interactive GUI utilities, map frameworks, advanced desktop environments, and expert‑grade platforms like R, Weka and Gephi—highlighting their key features, formats supported and typical use cases.

Big DataJavaScriptMapping
0 likes · 11 min read
20 Data Visualization Tools: From Entry‑Level to Expert Solutions
DataFunSummit
DataFunSummit
Aug 21, 2022 · Big Data

Alluxio Stress Testing Methods and Practices

This article explains the purpose, sources, and manifestations of pressure in Alluxio, describes its built‑in stress testing framework, outlines how to run and configure stress tools, and provides guidance on result calculation, reporting, common issues, and debugging for effective performance evaluation.

AlluxioBig DataPerformance Evaluation
0 likes · 11 min read
Alluxio Stress Testing Methods and Practices