Tagged articles

3675 articles

Page 25 of 37

Aug 23, 2020 · Big Data

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

This article introduces Apache Hudi, explaining its storage types, query views, timeline feature, typical use cases such as near‑real‑time ingestion and incremental pipelines, and provides a step‑by‑step Scala/Spark quick‑start guide with code examples for compiling, inserting, updating, querying, and syncing data to Hive.

Apache HudiBig DataData Lake

0 likes · 18 min read

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

Big Data Technology & Architecture

Aug 22, 2020 · Big Data

Integrating Kerberos with Spark on CDH: Configuration, Deployment, and Troubleshooting Guide

This guide explains how to prepare a CDH‑based Spark environment for Kerberos authentication, covering prerequisite knowledge, classpath adjustments, HBase configuration files, Spark‑Env settings, user permission grants, Spark‑Submit execution, and common troubleshooting steps.

Big DataCDHHBase

0 likes · 12 min read

Integrating Kerberos with Spark on CDH: Configuration, Deployment, and Troubleshooting Guide

Java Architect Essentials

Aug 21, 2020 · Big Data

Design and Integration of Flume, Kafka, Storm, Drools, and Redis for Real‑Time ETL Log Analysis

This article presents a modular architecture for real‑time ETL log analysis that combines Flume for log collection, Kafka as a buffering layer, Storm for stream processing, Drools for rule‑based data transformation, and Redis for fast storage, detailing installation, configuration, and code integration steps.

Big DataDroolsFlume

0 likes · 23 min read

Design and Integration of Flume, Kafka, Storm, Drools, and Redis for Real‑Time ETL Log Analysis

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Practical Guide to Building an Advertising Project with Spark and Kudu

This article provides a step‑by‑step tutorial on deploying a Spark‑based advertising data pipeline using Kudu, covering Hadoop setup, HDFS data loading, Spark application refactoring, Maven packaging, Yarn execution, and crontab scheduling for daily automated runs.

Big DataHadoopKudu

0 likes · 11 min read

Practical Guide to Building an Advertising Project with Spark and Kudu

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Business Project: Data Statistics and Processing Guide

This article demonstrates how to implement an advertising business data statistics pipeline using Spark and Kudu, detailing metric requirements, Scala processing code, complex SQL aggregations, schema design, and data sinking for verification.

Big DataKuduScala

0 likes · 7 min read

Spark + Kudu Advertising Business Project: Data Statistics and Processing Guide

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Business Project: Step-by-Step Implementation

This article walks through the complete implementation of an advertising statistics pipeline using Spark and Kudu, covering requirement analysis, Scala code development, SQL queries, schema definition, and data sinking, with full code snippets and execution results.

Big DataKuduScala

0 likes · 7 min read

Spark + Kudu Advertising Business Project: Step-by-Step Implementation

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Project: Refactoring, Scala Traits, ETL Processor, and Project Entry

This article walks through a Spark and Kudu advertising project, explaining the refactoring approach, Scala trait usage, implementation of ETL and province‑city statistics processors, and shows the complete Spark application entry point with full code examples.

Big DataETLKudu

0 likes · 7 min read

Spark + Kudu Advertising Project: Refactoring, Scala Traits, ETL Processor, and Project Entry

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Project: Province‑City Statistics and Data Persistence

This tutorial walks through a Spark‑Kudu advertising project that computes province‑city distribution statistics using SQL, defines the necessary schema, and demonstrates how to write the aggregated results back to a Kudu table for persistent storage, complete with Scala code examples.

Big DataKuduScala

0 likes · 4 min read

Spark + Kudu Advertising Project: Province‑City Statistics and Data Persistence

Huawei Cloud Developer Alliance

Aug 21, 2020 · Big Data

How Big Data and IoT Are Transforming Vehicle Networks: Opportunities and Challenges

This article explains the concepts of the Internet of Things and big data, explores how massive sensor data fuels smart transportation and vehicle networking, outlines practical applications such as real‑time traffic control and autonomous driving, and analyzes the technical and managerial bottlenecks hindering future growth.

Big DataIoTSmart Transportation

0 likes · 13 min read

How Big Data and IoT Are Transforming Vehicle Networks: Opportunities and Challenges

Liangxu Linux

Aug 19, 2020 · Operations

How to Quickly Analyze Beijing Residency Data with Shell Commands

This tutorial shows how to use standard Unix shell tools such as grep, cut, sort, uniq, awk, and join to extract insights—top companies, most common surnames, popular given names, age distribution, and hometown statistics—from a JSON dataset of over 6,000 Beijing residency applicants.

Big DataJSONShell

0 likes · 13 min read

How to Quickly Analyze Beijing Residency Data with Shell Commands

Big Data Technology & Architecture

Aug 19, 2020 · Big Data

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

This tutorial describes how to place advertising JSON data on HDFS, use Spark for ETL and analysis, enrich logs with IP lookup, and persist the results into Kudu with daily scheduling, including code examples and schema definitions.

Big DataETLIP lookup

0 likes · 17 min read

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

dbaplus Community

Aug 18, 2020 · Big Data

Designing a Scalable Financial Data Warehouse: Modeling, Layers, and Quality Control

This article outlines a comprehensive approach to building a financial data warehouse, covering background needs, modeling methodologies, a layered architecture (I, C, S, R), data quality monitoring, metadata management, and detailed naming and coding standards to ensure maintainable, high‑quality data pipelines.

Big DataData QualityMetadata Management

0 likes · 14 min read

Designing a Scalable Financial Data Warehouse: Modeling, Layers, and Quality Control

Suning Technology

Aug 18, 2020 · Backend Development

Boosting Mega‑Sale Stability: Suning’s Backend Data Components in Action

The article details how Suning’s transaction middle‑platform leverages custom TPS collection, advanced flow‑control, big‑data analytics, and AI‑driven forecasting to ensure system stability, capacity planning, and intelligent inventory distribution during the high‑traffic 818 promotional event.

AIBackendBig Data

0 likes · 17 min read

Boosting Mega‑Sale Stability: Suning’s Backend Data Components in Action

Big Data Technology & Architecture

Aug 18, 2020 · Big Data

End-to-End Real-Time Web Log Processing with Flume, Kafka, Spark Streaming, HBase, and Spring Boot

This tutorial demonstrates how to generate simulated web access logs in Python, schedule them with Crontab, collect them in real time using Flume, forward them to Kafka, process the streams with Spark Streaming, store results in HBase, and visualize the data via a Spring Boot application with ECharts.

Big DataEChartsFlume

0 likes · 36 min read

End-to-End Real-Time Web Log Processing with Flume, Kafka, Spark Streaming, HBase, and Spring Boot

Beike Product & Technology

Aug 17, 2020 · Big Data

Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

This article describes how a data management platform (DMP) at Beike leverages ClickHouse bitmap structures and Spark pipelines to generate global numeric user IDs, design tag-specific bitmap rules for enum, continuous, and date attributes, handle boundary cases, and produce high‑performance bitmap SQL for real‑time user group estimation and complex segment logic.

Big DataClickHouseDMP

0 likes · 17 min read

Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

Big Data Technology & Architecture

Aug 17, 2020 · Big Data

Complex Event Processing (CEP) with Flink: Concepts, Pattern API, and a Scala Practical Example

This article introduces Complex Event Processing (CEP), explains its core concepts and features, details Flink's Pattern API with individual, combined, and group patterns, and provides a complete Scala example that detects three consecutive login failures within three seconds using Flink CEP.

Big DataCEPFlink

0 likes · 10 min read

Complex Event Processing (CEP) with Flink: Concepts, Pattern API, and a Scala Practical Example

Big Data Technology & Architecture

Aug 16, 2020 · Big Data

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

This article provides a detailed introduction to HDFS, covering its application scenarios, core architecture, fault‑tolerance benefits, drawbacks such as high latency and small‑file inefficiency, essential shell and API commands, cluster management procedures, and newer Hadoop 2.0 features like HA, Federation, snapshots, ACLs, and heterogeneous storage.

Big DataCLIHA

0 likes · 10 min read

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

Big Data Technology & Architecture

Aug 15, 2020 · Big Data

Step-by-Step Guide to Building an ELK Stack with Kafka, Zookeeper, Logstash, and Filebeat for Log Collection

This tutorial provides a comprehensive, step-by-step procedure for setting up a log‑collection pipeline using Filebeat, Kafka, Zookeeper, Logstash, Elasticsearch, and Kibana across multiple servers, covering hardware preparation, system tuning, software installation, configuration files, and verification commands.

Big DataELKFilebeat

0 likes · 11 min read

Step-by-Step Guide to Building an ELK Stack with Kafka, Zookeeper, Logstash, and Filebeat for Log Collection

Big Data Technology & Architecture

Aug 15, 2020 · Big Data

Understanding Data Lakes: Concepts, Architecture, Vendor Solutions, and Practical Use Cases

This comprehensive article explains what a data lake is, outlines its core characteristics and reference architecture, compares major cloud providers' data‑lake offerings, presents typical advertising and gaming use cases, and proposes a practical, agile process for building and operating a data lake.

Big DataCloud NativeData Architecture

0 likes · 50 min read

Understanding Data Lakes: Concepts, Architecture, Vendor Solutions, and Practical Use Cases

Suning Technology

Aug 14, 2020 · Big Data

Building SuNing’s Supply‑Chain Data Platform with DDD and Big‑Data Design

This article recounts SuNing’s step‑by‑step journey of designing and implementing a supply‑chain data middle platform, outlining its business rationale, DDD‑based domain modeling, layered system architecture, and practical deployment insights that illustrate how a tailored big‑data solution can enhance data services and governance.

Big DataDDDData Governance

0 likes · 11 min read

Building SuNing’s Supply‑Chain Data Platform with DDD and Big‑Data Design

Huolala Tech

Aug 13, 2020 · Operations

How Huolala’s “Smart Brain” Uses AI and Optimization to Revolutionize Logistics

At the 2020 Global Logistics Technology Conference in Haikou, Huolala CTO Zhang Hao detailed the company’s self‑developed “Smart Brain” system, which leverages AI, big‑data analytics, IoT and custom optimization algorithms to achieve real‑time, intelligent dispatch, dynamic pricing and safer, more efficient logistics operations.

AIBig DataIoT

0 likes · 6 min read

How Huolala’s “Smart Brain” Uses AI and Optimization to Revolutionize Logistics

Aikesheng Open Source Community

Aug 13, 2020 · Databases

Introduction to ClickHouse: Features, Installation, Performance Testing, and Comparison

This article introduces ClickHouse, an open‑source column‑oriented OLAP database, detailing its key features, appropriate use cases, installation steps, performance benchmark queries, and how it compares with other columnar storage solutions while highlighting its adoption by major internet companies.

Big DataClickHouseColumnar Database

0 likes · 10 min read

Introduction to ClickHouse: Features, Installation, Performance Testing, and Comparison

Architecture Digest

Aug 13, 2020 · Big Data

Synchronizing Billion-Row MySQL Data to HBase: Three Practical Schemes and Implementation Guide

This comprehensive guide details three practical methods for syncing massive MySQL datasets to HBase—including Sqoop, Kafka‑Thrift, and Flink pipelines—covering environment setup, configuration, code examples, performance comparisons, and optimization tips for large‑scale data ingestion and querying.

Big DataFlinkHBase

0 likes · 24 min read

Synchronizing Billion-Row MySQL Data to HBase: Three Practical Schemes and Implementation Guide

Big Data Technology & Architecture

Aug 13, 2020 · Big Data

Configuring Kerberos‑Enabled HDFS Access with Maven in a Hadoop Cluster

This guide walks through setting up a Maven project, adding Hadoop dependencies, configuring Kerberos (krb5.conf and keytab), loading core‑site.xml, and providing Java utility classes to initialize the HDFS client and list files in an HA‑enabled Hadoop cluster.

Big DataHDFSHadoop

0 likes · 5 min read

Configuring Kerberos‑Enabled HDFS Access with Maven in a Hadoop Cluster

Big Data Technology Architecture

Aug 13, 2020 · Databases

Deep Dive into Apache Druid V1 Storage Format: Index Structures and Disk Layout

This article provides a detailed analysis of Apache Druid V1's column‑oriented storage format, covering dimension dictionaries, variable‑length encoded values, bitmap inverted indexes, array handling, and the physical metadata layout that enables sub‑second OLAP queries on massive datasets.

Apache DruidBig DataBitmap Index

0 likes · 8 min read

Deep Dive into Apache Druid V1 Storage Format: Index Structures and Disk Layout

Tencent Cloud Middleware

Aug 12, 2020 · Big Data

How Serverless Functions Can Replace Traditional Kafka Data Pipelines for Lower Cost and Easier Scaling

This article explains how Tencent Cloud CKafka works, describes the challenges of traditional open‑source data‑flow solutions, and demonstrates a Serverless Function approach—complete with architecture diagrams and code examples—to achieve low‑cost, auto‑scaling Kafka‑to‑Elasticsearch pipelines.

Big DataCKafkaElasticsearch

0 likes · 12 min read

How Serverless Functions Can Replace Traditional Kafka Data Pipelines for Lower Cost and Easier Scaling

IT Architects Alliance

Aug 12, 2020 · Big Data

Introduction to Confluent KSQL for Real-Time Stream Processing

This article introduces Confluent KSQL, a SQL‑based real‑time stream processing engine for Kafka, covering its architecture, stream vs table concepts, query lifecycle, Docker‑based setup, DDL commands, example joins, windowed aggregations, connectors, and its advantages and limitations.

Big DataDockerKSQL

0 likes · 9 min read

Introduction to Confluent KSQL for Real-Time Stream Processing

Big Data Technology & Architecture

Aug 12, 2020 · Big Data

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

This guide explains how to continuously collect web‑service user behavior logs, route them through Flume agents to Kafka, and finally ingest them with Spark Streaming into HDFS, covering environment preparation, configuration files, deployment steps, and verification procedures.

Big DataFlumeHadoop

0 likes · 9 min read

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

Architects' Tech Alliance

Aug 11, 2020 · Big Data

Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices

This article provides an extensive summary of data middle platform concepts, covering data aggregation, collection tools, offline and real‑time development, data governance, service layers, warehouse construction, and operational practices, illustrating how enterprises build and manage a unified data ecosystem.

Big DataData GovernanceData Middle Platform

0 likes · 27 min read

Comprehensive Overview of Data Middle Platform Architecture, Components, and Practices

Big Data Technology & Architecture

Aug 11, 2020 · Big Data

Consuming Kerberos‑Protected Kafka Data with Spark Streaming and Storing into Kudu

This guide demonstrates how to configure a Spark Streaming application running on YARN in cluster mode to securely consume Kerberos‑protected Kafka topics and write the processed data into Kudu tables, including necessary Java code, Kerberos keytab setup, Kafka client configuration, and spark‑submit commands.

Big DataKafkaKerberos

0 likes · 11 min read

Consuming Kerberos‑Protected Kafka Data with Spark Streaming and Storing into Kudu

Big Data Technology & Architecture

Aug 10, 2020 · Big Data

Real-time Hot Item, PV, and UV Statistics Using Apache Flink, Kafka, and Bloom Filter

This article demonstrates how to implement real-time hot item ranking, page view counting, and unique visitor estimation using Apache Flink with Kafka sources, sliding windows, custom aggregation functions, and a Bloom filter backed by Redis, providing complete Scala code examples.

Big DataFlinkKafka

0 likes · 15 min read

Real-time Hot Item, PV, and UV Statistics Using Apache Flink, Kafka, and Bloom Filter

Big Data Technology & Architecture

Aug 10, 2020 · Fundamentals

Understanding Bloom Filter: Concept, Principles, Implementation, and Applications

This article explains the concept, principles, implementation details, and practical applications of Bloom Filters, including formulas for optimal bit array size and hash count, Java code examples using Guava, and common use cases such as deduplication, web crawling, and spam filtering.

Big DataGuavabloom-filter

0 likes · 12 min read

Understanding Bloom Filter: Concept, Principles, Implementation, and Applications

Python Crawling & Data Mining

Aug 8, 2020 · Big Data

How Python Data Mining Uncovers Why '30 Only' Became a Summer Hit

This article uses Python to scrape and analyze Douban ratings, user comments, and Tencent video danmu for the TV drama “30 Only”, revealing the show’s explosive popularity, the most discussed characters, and audience sentiment through statistical charts and word‑cloud visualizations.

Big DataPythonTV Drama Analysis

0 likes · 11 min read

How Python Data Mining Uncovers Why '30 Only' Became a Summer Hit

Big Data Technology & Architecture

Aug 8, 2020 · Big Data

Setting Up InfluxDB and Grafana for Flink Metrics Monitoring

This guide walks through installing InfluxDB and Grafana on CentOS, configuring InfluxDB for Flink metrics storage, creating databases and retention policies, integrating the Flink InfluxDB reporter, and building Grafana dashboards to visualize real‑time Flink job metrics.

Big DataFlinkGrafana

0 likes · 8 min read

Setting Up InfluxDB and Grafana for Flink Metrics Monitoring

Ctrip Technology

Aug 6, 2020 · Big Data

Data Governance Practices and Model Design in Ctrip Vacation Data Warehouse

This article shares the practical experience and thinking behind Ctrip's vacation data governance project, covering team efficiency optimization, demand sorting, data domain definition, warehouse layering, unified dimension modeling, metric standardization, and the overall benefits of a centralized data governance framework.

Big DataCtripData Governance

0 likes · 17 min read

Data Governance Practices and Model Design in Ctrip Vacation Data Warehouse

Youku Technology

Aug 6, 2020 · Big Data

Alibaba Entertainment Data Platform: The Journey Ahead

The presentation outlines how Alibaba's entertainment data platform has evolved to meet the real‑time, low‑cost, and scalable analytics demands of campaigns such as Double 11 and 618, detailing its architecture, real‑time processing, pre‑computed data cubes, practical design choices, and lessons learned from implementation challenges.

Big Datareal-time analytics

0 likes · 1 min read

Alibaba Entertainment Data Platform: The Journey Ahead

Big Data Technology & Architecture

Aug 6, 2020 · Big Data

Flink Configuration Parameters and Related Tuning for Kafka and Yarn

This article provides a comprehensive guide to configuring Apache Flink—including job manager and task manager settings, high‑availability via Zookeeper, metrics reporting, as well as Kafka producer tuning and Yarn resource adjustments—to help practitioners optimize big‑data streaming jobs.

Big DataFlinkHA

0 likes · 8 min read

Flink Configuration Parameters and Related Tuning for Kafka and Yarn

Big Data Technology & Architecture

Aug 5, 2020 · Big Data

An Introduction to Apache Kylin: Architecture, Core Concepts, Installation, and Enterprise Use Cases

This article provides a comprehensive overview of Apache Kylin, covering its background, core OLAP concepts, technical architecture, installation steps, cube-building methods, real‑world enterprise deployments, and resources for further learning, illustrating how it enables sub‑second query performance on massive datasets.

Apache KylinBig DataCube

0 likes · 20 min read

An Introduction to Apache Kylin: Architecture, Core Concepts, Installation, and Enterprise Use Cases

Fulu Network R&D Team

Aug 4, 2020 · Big Data

Practical Experience with State Management in Flink Real‑Time Stream Processing

This article shares practical experiences and insights on using different types of state in Apache Flink for real‑time stream processing, covering managed versus raw state, code examples in Scala and Java, handling late data, dimension table joins, distinct semantics, and best‑practice recommendations.

Big DataFlinkManaged State

0 likes · 15 min read

Practical Experience with State Management in Flink Real‑Time Stream Processing

Dada Group Technology

Aug 4, 2020 · Big Data

Design and Implementation of the Tianhe Data Tracking Management Platform at Dada Group

The article describes how Dada Group created the Tianhe platform to centralize, standardize, and automate massive data‑tracking (埋点) requirements across multiple product lines, detailing its goals, architecture, core functions, current status, and future development directions.

Big DataData QualityData Tracking

0 likes · 10 min read

Design and Implementation of the Tianhe Data Tracking Management Platform at Dada Group

21CTO

Aug 1, 2020 · Big Data

Mastering User Profiling: A Comprehensive Big Data Blueprint

This article explains how enterprises can leverage massive raw and business data to build detailed user profiles, covering tag types, data architecture, development modules, project phases, key deliverables, and a real-world e‑commerce case study.

Big DataETLSpark

0 likes · 22 min read

Mastering User Profiling: A Comprehensive Big Data Blueprint

DataFunTalk

Aug 1, 2020 · Big Data

User Profiling Methodology and Engineering Solutions

This article explains the fundamentals of user profiling in the big data era, covering tag types, data architecture, development modules, a step‑by‑step implementation process, a practical e‑commerce case study, table design strategies, and both quantitative and qualitative profiling methods.

Big DataETLmachine learning

0 likes · 22 min read

User Profiling Methodology and Engineering Solutions

Tianxing Digital Tech User Experience

Jul 31, 2020 · Big Data

How Pandemic Data Visualization Evolved: From John Snow’s Cholera Map to Modern COVID Dashboards

This article traces the history and development of pandemic data visualization—from 19th‑century cholera maps and early 2000s SARS charts to sophisticated COVID‑19 dashboards—while outlining five essential design principles that make such visualizations clear, engaging, and impactful.

Big DataCOVID-19design principles

0 likes · 13 min read

How Pandemic Data Visualization Evolved: From John Snow’s Cholera Map to Modern COVID Dashboards

Programmer DD

Jul 31, 2020 · Big Data

How to Find Common URLs in 5 Billion‑Entry Files with Only 4 GB RAM

This article explains how to locate the intersecting URLs between two 5‑billion‑record files (≈320 GB total) using a hash‑based divide‑and‑conquer method that fits within a strict 4 GB memory limit.

Big DataMemory OptimizationURL intersection

0 likes · 3 min read

How to Find Common URLs in 5 Billion‑Entry Files with Only 4 GB RAM

Tencent Cloud Developer

Jul 30, 2020 · Big Data

Cost Governance Practices in Youzan's Data Middle Platform

Youzan's data middle platform faced cost growth outpacing business due to low utilization and storage inefficiencies; they applied utilization standards, containerization, COS storage migration, offline task optimization, and fine-grained cost-billing, achieving a 12% compute boost, 17% batch savings, 80% storage cost cut, and over 25% overall cost reduction.

Big DataCloud ComputingContainerization

0 likes · 24 min read

Cost Governance Practices in Youzan's Data Middle Platform

Big Data Technology & Architecture

Jul 30, 2020 · Big Data

Understanding Bucket Sampling Queries in Hive

This article explains Hive's bucket sampling syntax, demonstrates how to use the TABLESAMPLE clause with various bucket parameters, provides concrete SQL examples, and clarifies the underlying hash‑based mechanism that determines which rows are returned.

Big DataBucket SamplingTablesample

0 likes · 4 min read

Understanding Bucket Sampling Queries in Hive

Big Data Technology & Architecture

Jul 29, 2020 · Big Data

Sqoop Tutorial: Importing and Exporting Data between Relational Databases, HDFS, Hive, and HBase

This article provides a comprehensive guide to using Sqoop for importing data from relational databases into HDFS, Hive, and HBase, as well as exporting data back to databases, covering command syntax, options, and practical examples for big‑data workflows.

Big DataHBaseHDFS

0 likes · 8 min read

Sqoop Tutorial: Importing and Exporting Data between Relational Databases, HDFS, Hive, and HBase

Tencent Cloud Developer

Jul 29, 2020 · Big Data

Case Study: Optimizing Tencent Cloud Elasticsearch for High‑Volume Game Log Analytics

To handle a gaming company's million‑QPS log stream, the team built a hot‑cold Tencent Cloud Elasticsearch cluster with ILM‑driven tiering, scaled CPU/heap, reduced shard count via shrink and replica tweaks, tuned Logstash‑Kafka pipelines, and employed COS snapshots and searchable snapshots, achieving stable performance and lower cost.

Big DataElasticsearchILM

0 likes · 29 min read

Case Study: Optimizing Tencent Cloud Elasticsearch for High‑Volume Game Log Analytics

Youzan Coder

Jul 29, 2020 · Big Data

How We Migrated a 200‑Node Hadoop Cluster Across Data Centers: Lessons and Strategies

This article presents a comprehensive case study of migrating a 200‑plus node Hadoop offline platform across data centers, covering background, architecture, solution evaluation, detailed implementation steps, consistency checks, operational safeguards, encountered issues, and future recommendations.

Big DataDP PlatformData Consistency

0 likes · 21 min read

How We Migrated a 200‑Node Hadoop Cluster Across Data Centers: Lessons and Strategies

Big Data Technology & Architecture

Jul 28, 2020 · Big Data

Enabling CGroup in Hadoop Yarn NodeManager to Limit Container CPU Resources

This article explains how to enable Linux CGroup support in Hadoop Yarn NodeManager to limit container CPU usage, detailing required configuration properties, hierarchy setup, CPU limit parameters, and a critical kernel version caveat.

Big DataCPUHadoop

0 likes · 7 min read

Enabling CGroup in Hadoop Yarn NodeManager to Limit Container CPU Resources

MaGe Linux Operations

Jul 28, 2020 · Big Data

How Leading Chinese Companies Scale Elasticsearch for Billions of Orders

This article surveys how major Chinese tech firms such as JD.com, Ctrip, Didi, and 58.com deploy and evolve Elasticsearch clusters to handle massive order data, log analysis, real‑time monitoring, and security tasks, detailing architecture choices, shard strategies, multi‑cluster designs, and performance optimizations.

Big DataElasticsearchOrder Management

0 likes · 11 min read

How Leading Chinese Companies Scale Elasticsearch for Billions of Orders

Xianyu Technology

Jul 28, 2020 · Operations

ShenTan: Automated Fault Localization System for Online Services

ShenTan is an automated fault‑localization platform for online services that quickly (under five seconds) pinpoints server‑side issues with developer‑level accuracy by aggregating real‑time metrics, applying a decision‑tree model enriched by expert knowledge and dynamic thresholds, and presenting results through an integrated alert and visualization system, while planning broader endpoint coverage and multi‑tenant support.

AutomationBig DataFault Localization

0 likes · 12 min read

ShenTan: Automated Fault Localization System for Online Services

Big Data Technology & Architecture

Jul 27, 2020 · Big Data

How to View Hadoop/YARN Application Logs via History Server and Yarn Commands

This guide explains how to retrieve Hadoop/YARN application logs using the History Server UI, Yarn command‑line tools, and direct HDFS log access, including commands for listing applications, fetching specific logs, and locating the remote log directory.

Big DataCLIHDFS

0 likes · 4 min read

How to View Hadoop/YARN Application Logs via History Server and Yarn Commands

dbaplus Community

Jul 26, 2020 · Big Data

How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

Facing thousands of nodes in expanding big‑data clusters, the author evaluates legacy monitoring stacks, selects Prometheus + Alertmanager + Grafana, and details its architecture, custom exporters, real‑time alerts, self‑healing mechanisms, and visual dashboards that now support ten large clusters and dozens of services.

AlertmanagerBig DataGrafana

0 likes · 11 min read

How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

DataFunTalk

Jul 23, 2020 · Big Data

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Control, and Metadata Management

This article outlines the end‑to‑end design of a financial data warehouse, covering background needs, modeling methodology choices, a layered architecture, data quality monitoring, metadata management, naming and coding standards, and future improvement directions.

Big DataData QualityMetadata Management

0 likes · 11 min read

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Control, and Metadata Management

Big Data Technology & Architecture

Jul 23, 2020 · Big Data

Comprehensive Kafka FAQ: Uses, Architecture, Offsets, and Partition Management

This article provides an extensive overview of Apache Kafka, covering its use cases, key concepts such as ISR, AR, HW, LEO, and LW, message ordering, the roles of partitioners, serializers and interceptors, producer and consumer client architecture, offset handling, multithreaded consumption, and topic partition management.

Big DataKafkaMessage Queue

0 likes · 16 min read

Comprehensive Kafka FAQ: Uses, Architecture, Offsets, and Partition Management

dbaplus Community

Jul 22, 2020 · Databases

How to Optimize Real‑Time Vector Tile Services for Millions of Features with PostgreSQL & PostGIS

This article explains how to efficiently browse and render millions of GIS features in real‑time vector tiles using PostgreSQL and PostGIS, covering background challenges, several thinning algorithms, their implementation steps, limitations, advantages, and a practical example with a 3‑million‑point dataset.

Big DataData DilutionGIS

0 likes · 8 min read

How to Optimize Real‑Time Vector Tile Services for Millions of Features with PostgreSQL & PostGIS

Big Data Technology & Architecture

Jul 22, 2020 · Big Data

Kafka Architecture and Core Concepts: Producers, Brokers, and Consumers

This article explains Kafka's fundamental architecture, including the roles of producers, brokers, and consumers, key concepts such as topics, partitions, replicas, ISR, and controller, as well as detailed mechanisms of producer client structure, interceptors, serializers, partitioners, and consumer group rebalancing strategies.

Big DataDistributed SystemsKafka

0 likes · 22 min read

Kafka Architecture and Core Concepts: Producers, Brokers, and Consumers

Alibaba Cloud Developer

Jul 22, 2020 · Big Data

Exploring the Apache Big Data Ecosystem: Hadoop, Spark, Flink, and More

This article surveys the rapidly evolving big data landscape by reviewing a wide range of Apache projects—including Hadoop, Spark, Flink, HBase, Kudu, Impala, Kafka, and others—detailing their core components, architectures, strengths, and typical use‑cases for building distributed data platforms.

ApacheBig DataDistributed Systems

0 likes · 20 min read

Exploring the Apache Big Data Ecosystem: Hadoop, Spark, Flink, and More

Tencent Cloud Developer

Jul 21, 2020 · Big Data

Scaling Tencent Meeting Video Stream Quality Analysis with Tencent Cloud Elasticsearch

Facing explosive growth and massive video‑stream quality data, Tencent Meeting migrated its custom Lucene‑based analysis engine to Tencent Cloud Elasticsearch, which delivered over 1 million writes per second, automatic sharding, reduced latency from hours to seconds, and sustained 99.99% availability, proving a high‑performance, scalable solution for large‑scale video conferencing.

Big DataCloud ComputingElasticsearch

0 likes · 16 min read

Scaling Tencent Meeting Video Stream Quality Analysis with Tencent Cloud Elasticsearch

Big Data Technology & Architecture

Jul 20, 2020 · Big Data

Kafka Workflow and File Storage Mechanism: Topics, Partitions, Segments, Index and Log Files

This article explains Kafka’s workflow, detailing how topics, partitions, and segments are organized, the structure of index and log files, message composition, offset-based retrieval, and the overall data directory layout, providing a comprehensive overview of Kafka’s storage architecture.

Big DataKafkaOFFSET

0 likes · 8 min read

Kafka Workflow and File Storage Mechanism: Topics, Partitions, Segments, Index and Log Files

Big Data Technology & Architecture

Jul 19, 2020 · Big Data

An Overview of Hive, HBase Integration, Apache Phoenix, and Lealone in the Big Data Ecosystem

This article explains Hive's role as a Hadoop‑based data warehouse, its integration with HBase, the advantages and drawbacks of that combination, introduces Apache Phoenix as a high‑performance SQL layer on HBase, and describes the open‑source NewSQL database Lealone, providing practical usage scenarios and performance comparisons.

Big DataHBaseLealone

0 likes · 9 min read

An Overview of Hive, HBase Integration, Apache Phoenix, and Lealone in the Big Data Ecosystem

Big Data Technology & Architecture

Jul 18, 2020 · Big Data

Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions

This article compiles frequent Spark SQL, Spark Core, PySpark, and Streaming problems—such as filesystem errors, configuration pitfalls, memory limits, shuffle failures, and version incompatibilities—along with concise explanations of their causes and step‑by‑step remediation methods for big‑data environments.

Big DataPySparkSpark

0 likes · 14 min read

Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions

Python Crawling & Data Mining

Jul 17, 2020 · Big Data

What Do Gaokao Numbers Reveal? Python-Powered Deep Dive into China’s College Admissions

This article uses Python to scrape and analyze over 2,900 Chinese university and major data points, revealing trends in Gaokao participation, provincial enrollment, university types, popularity rankings, and public curiosity about majors, all illustrated with charts and code examples.

Big DataGaokaoPython

0 likes · 12 min read

What Do Gaokao Numbers Reveal? Python-Powered Deep Dive into China’s College Admissions

Beike Product & Technology

Jul 16, 2020 · Backend Development

Kafka Connect: Introduction and Concepts for Data Pipelines

This article introduces Kafka Connect, a framework for building scalable data pipelines between Kafka and other systems, covering its architecture, key concepts like connectors and tasks, and practical deployment examples.

Big DataDistributed SystemsETL

0 likes · 20 min read

Kafka Connect: Introduction and Concepts for Data Pipelines

Ctrip Technology

Jul 16, 2020 · Big Data

Design and Architecture of the User Profiling System at Ctrip Business Travel

This article describes the concept, tag taxonomy, data flow architecture, and Lambda‑based query service design of Ctrip Business Travel's user profiling system, highlighting how batch and real‑time processing with Spark, Flink, Hive, MongoDB and Redis enable precise marketing, risk control and personalized services.

Big DataCtripdata pipeline

0 likes · 12 min read

Design and Architecture of the User Profiling System at Ctrip Business Travel

Big Data Technology & Architecture

Jul 16, 2020 · Big Data

Spark Configuration Parameters and Performance Tuning Guidelines

This article explains the purpose, default values, and practical tuning recommendations for common Spark submit options such as executor counts, memory settings, shuffle behavior, speculation, and various Spark SQL configurations to help users optimize big‑data workloads.

Big DataExecutorPerformance Tuning

0 likes · 14 min read

Spark Configuration Parameters and Performance Tuning Guidelines

Architect

Jul 15, 2020 · Big Data

Understanding Flink Task Slots, Resource Allocation, and Slot Sharing Mechanisms

This article explains how Flink uses task slots to partition TaskManager resources, the benefits of slot sharing, the interaction between Scheduler, SlotPool, and ResourceManager, and the internal classes such as LogicalSlot, PhysicalSlot, and SlotSharingManager that enable resource isolation and sharing in stream processing jobs.

Big DataFlinkResource Management

0 likes · 6 min read

Understanding Flink Task Slots, Resource Allocation, and Slot Sharing Mechanisms

Youzan Coder

Jul 15, 2020 · Big Data

Design and Implementation of Youzan ABTest System for Data‑Driven Growth

Youzan created an internal A/B testing platform—combining Java/Node SDKs, a real‑time data pipeline, and a metadata‑driven workflow—to enable data‑driven product iteration, granular traffic allocation, automated logging, statistical analysis, and scalable growth insights across its merchant services, while planning further automation and integration.

A/B testingBig DataExperiment Platform

0 likes · 19 min read

Design and Implementation of Youzan ABTest System for Data‑Driven Growth

Huolala Tech

Jul 15, 2020 · Big Data

How to Build Smart, Scalable Data Tracking Solutions for Comprehensive Analytics

This article explores the fundamentals, common schemes, pain points, and a smart end‑to‑end solution for data tracking (埋点), offering practical guidelines, architectural diagrams, and a concrete example to help engineers implement comprehensive, controllable, and efficient event collection pipelines.

AnalyticsBig DataData Tracking

0 likes · 9 min read

How to Build Smart, Scalable Data Tracking Solutions for Comprehensive Analytics

58 Tech

Jul 13, 2020 · Big Data

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Monitoring, and Metadata Management

This article presents a comprehensive design and implementation guide for a financial data warehouse, covering background needs, modeling methodology choices, a layered architecture, data quality monitoring, metadata management, naming and coding standards, and future development directions.

Big DataData QualityETL

0 likes · 11 min read

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Monitoring, and Metadata Management

Big Data Technology & Architecture

Jul 13, 2020 · Big Data

Understanding and Optimizing Flink Checkpoint Mechanism for Large-Scale State

This article explains Flink's checkpoint mechanism, outlines key performance metrics, discusses interval configuration, external state storage choices, resource allocation, and task-local recovery strategies to improve checkpoint speed and reliability in large‑scale state scenarios.

Big DataCheckpointFlink

0 likes · 5 min read

Understanding and Optimizing Flink Checkpoint Mechanism for Large-Scale State

Architects Research Society

Jul 12, 2020 · Databases

GraphTech Ecosystem Overview: Graph Database Landscape and Storage Options (2019)

This article surveys the 2019 GraphTech ecosystem, detailing the rapid growth of graph databases, market drivers, ecosystem layers, and the variety of native and multi‑model storage systems that support graph‑structured data.

Big DataDatabase EcosystemStorage Systems

0 likes · 7 min read

GraphTech Ecosystem Overview: Graph Database Landscape and Storage Options (2019)

Big Data Technology & Architecture

Jul 12, 2020 · Big Data

Design and Implementation of Ozone Data Exploration Service (Recon Server)

This article explains the design of a data exploration service for large‑scale distributed storage systems, detailing metadata synchronization, index reconstruction, aggregation tables, node‑level statistics, a user console, and the transition from checkpoint‑based snapshots to delta updates using RocksDB WAL in Hadoop Ozone Recon Server.

Big DataDelta UpdatesOzone

0 likes · 9 min read

Design and Implementation of Ozone Data Exploration Service (Recon Server)

Big Data Technology & Architecture

Jul 10, 2020 · Big Data

Creating a Test Table in Phoenix/HBase and Implementing a Custom Bitmap Aggregation Function in Spark

This tutorial demonstrates how to create a VARBINARY test table in HBase using Phoenix, serialize its data with RoaringBitmap, implement a custom Spark aggregation function to merge bitmap values, and query the table via Spark SQL, showcasing a practical big-data processing workflow.

Big DataHBasePhoenix

0 likes · 6 min read

Creating a Test Table in Phoenix/HBase and Implementing a Custom Bitmap Aggregation Function in Spark

GrowingIO Tech Team

Jul 9, 2020 · Big Data

How BitMap Storage Boosts Event Analysis Performance in Big Data Platforms

This article explains GrowingIO's event analysis data model, the challenges of metric‑dimension calculations on massive datasets, and how a BitMap‑based vertical storage and dimension‑combination numbering dramatically improve query efficiency and scalability.

Big Databitmapevent analysis

0 likes · 11 min read

How BitMap Storage Boosts Event Analysis Performance in Big Data Platforms

Big Data Technology & Architecture

Jul 9, 2020 · Big Data

How ZooKeeper Supports HBase: Coordination, Fault Tolerance, Log Splitting, META Table Management, and Replication

This article explains how ZooKeeper functions as a distributed coordination service for HBase, detailing its role in master and RegionServer fault tolerance, log splitting, META table location tracking, and replication management, illustrating the underlying ZNode structures and failover mechanisms.

Big DataDistributed CoordinationHBase

0 likes · 7 min read

How ZooKeeper Supports HBase: Coordination, Fault Tolerance, Log Splitting, META Table Management, and Replication

Sohu Tech Products

Jul 8, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach

The article analyzes data‑warehouse workflow scenarios, explains core concepts such as OLAP, multidimensional modeling and layer architecture, reviews existing workflow engines like Azkaban, Oozie and Airflow, and proposes a task‑and‑instance layered optimization that simplifies dependency configuration, improves collaboration, and supports complex scheduling in modern big‑data environments.

Big DataETLWorkflow

0 likes · 21 min read

Optimizing Workflow in Data Warehouse Construction: A Layered Task‑Instance Approach

Big Data Technology & Architecture

Jul 8, 2020 · Big Data

Using Spark SQL User-Defined Functions, Aggregate Functions, and Window Functions

This article demonstrates how to create and register custom scalar UDFs, untyped and type‑safe aggregate functions (UDAF and Aggregator) in Spark SQL, and how to apply window functions such as ROW_NUMBER, providing complete Scala code examples and execution results.

AggregatorBig DataScala

0 likes · 16 min read

Using Spark SQL User-Defined Functions, Aggregate Functions, and Window Functions

dbaplus Community

Jul 7, 2020 · Big Data

How Flink + ClickHouse Power Real‑Time Analytics at Scale

This article explains how FunTouTiao builds a high‑performance real‑time analytics pipeline using Flink, Hive, and ClickHouse, covering business scenarios, hour‑level and second‑level Flink‑to‑Hive architectures, streaming file sink mechanics, multi‑user permissions, ClickHouse performance tricks, and future roadmap for unified stream‑batch storage.

Big DataClickHouseFlink

0 likes · 18 min read

How Flink + ClickHouse Power Real‑Time Analytics at Scale

Programmer DD

Jul 7, 2020 · Big Data

How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution

The article outlines a three‑dimensional framework—technical depth, ecosystem breadth, and evolution capability—to help engineers decide which big‑data or stream‑processing technology (such as Hadoop, Spark, or Flink) is worth investing time in, and provides practical tips like using Google Trends and GitHub awesome lists.

Big DataFlinkHadoop

0 likes · 12 min read

How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution

Big Data Technology & Architecture

Jul 7, 2020 · Big Data

Analysis of Apache Spark's Unified Memory Management Model (Spark 2.2.1)

This article analyzes Apache Spark's executor-side memory management model, focusing on the UnifiedMemoryManager in Spark 2.2.1, detailing on‑heap and off‑heap memory regions, dynamic execution/storage memory sharing, task memory allocation, and practical configuration examples.

Big DataExecutorMemory Management

0 likes · 10 min read

Analysis of Apache Spark's Unified Memory Management Model (Spark 2.2.1)

dbaplus Community

Jul 5, 2020 · Big Data

How a Chinese Bank Built a Real‑Time Log Management Platform with Apollo and Elasticsearch

Facing massive, multi‑system log volumes, China Minsheng Bank’s big‑data team designed a real‑time intelligent log platform by integrating Ctrip’s open‑source Apollo configuration center with Elasticsearch, enabling centralized, versioned, hot‑reloading configuration, role‑based parameter management, and high‑availability deployment across thousands of servers.

ApolloBig DataDevOps

0 likes · 30 min read

How a Chinese Bank Built a Real‑Time Log Management Platform with Apollo and Elasticsearch

Big Data Technology & Architecture

Jul 5, 2020 · Big Data

Understanding Spark Memory Management: On‑heap, Off‑heap, and Unified Memory

This article provides a comprehensive overview of Spark's memory management, covering executor memory architecture, the differences between on‑heap and off‑heap memory, static versus unified memory managers, storage and execution memory handling, and practical guidelines for optimizing Spark applications.

Big DataExecutorMemory Management

0 likes · 21 min read

Understanding Spark Memory Management: On‑heap, Off‑heap, and Unified Memory

Architect

Jul 4, 2020 · Big Data

Kuaishou Flink Real‑Time Architecture and Spring Festival Gala Assurance Practices

This article details Kuaishou's Flink‑based real‑time computing architecture, its massive cluster scale, and the comprehensive strategies—including overload protection, system stability, pressure testing, and resource guarantees—implemented to ensure reliable streaming for the 2020 Spring Festival Gala and its real‑time dashboard.

Big DataFlinkKuaishou

0 likes · 12 min read

Kuaishou Flink Real‑Time Architecture and Spring Festival Gala Assurance Practices

Youzan Coder

Jul 3, 2020 · Big Data

Data Cost Quantification, Billing, and Optimization in a Data Platform

The data‑platform team introduced a self‑sustaining cost‑reduction framework that quantifies CPU, memory, and disk expenses using price‑per‑resource formulas, applies time‑weighted billing, generates multi‑level reports, and drives optimization through six actionable “swords” and incentive‑based operations, achieving roughly 17 % offline‑cluster savings within six months.

Big DataCost OptimizationResource Quantification

0 likes · 15 min read

Data Cost Quantification, Billing, and Optimization in a Data Platform

Big Data Technology & Architecture

Jul 2, 2020 · Big Data

KSQL Quick Start: Deploying and Querying Kafka Data with Streaming SQL

This article introduces KSQL as a lightweight streaming SQL engine for Apache Kafka, explains its architecture and core concepts of streams and tables, and provides step‑by‑step deployment instructions, command‑line examples for creating streams/tables, querying data, and managing persistent queries.

Apache KafkaBig DataKSQL

0 likes · 10 min read

KSQL Quick Start: Deploying and Querying Kafka Data with Streaming SQL

Big Data Technology & Architecture

Jul 1, 2020 · Big Data

Overview of Spark SQL Adaptive Execution Optimization Engine

This article explains Spark SQL's Adaptive Execution engine, covering its background, dynamic plan adjustments, shuffle partition tuning, data skew mitigation techniques, and the key configuration parameters needed to enable and fine‑tune adaptive query execution for improved performance.

Adaptive ExecutionBig DataData Skew

0 likes · 7 min read

Overview of Spark SQL Adaptive Execution Optimization Engine

Youzan Coder

Jul 1, 2020 · Big Data

Mastering HiveCube: Efficient Multi‑Dimensional Aggregation with Grouping Sets

This article explains how HiveCube can replace traditional development for multi‑dimensional aggregation in a data‑warehouse, covering background, theory of cube, with‑cube/rollup/grouping‑sets syntax, grouping_id handling, practical implementation tips, performance tuning, and a comparison with conventional methods.

Big DataCubeGrouping Sets

0 likes · 19 min read

Mastering HiveCube: Efficient Multi‑Dimensional Aggregation with Grouping Sets

Tencent Advertising Technology

Jun 29, 2020 · Artificial Intelligence

2020 Tencent Advertising Rhinoceros Bird Special Research Program Call for Proposals

The Tencent Advertising Rhinoceros Bird Special Research Program, launched in June 2020, invites global academia to collaborate on advertising technology challenges in AI, big data, and related fields, outlining the application process, evaluation criteria, and accompanying Wiztalk lecture series.

Big DataTencent AdvertisingWiztalk Lectures

0 likes · 4 min read

2020 Tencent Advertising Rhinoceros Bird Special Research Program Call for Proposals

Big Data and Microservices

Jun 28, 2020 · Big Data

Data Warehouse vs Data Lake vs Data Platform vs Data Middle Platform: Which Fits Your Business?

This article compares data warehouse, data lake, data platform, and data middle platform, explaining their definitions, architectures, strengths, limitations, and use‑case differences, and provides tables that highlight how each solution handles structured and unstructured data, governance, flexibility, and business value.

Big DataData ArchitectureData Lake

0 likes · 12 min read

Data Warehouse vs Data Lake vs Data Platform vs Data Middle Platform: Which Fits Your Business?

Full-Stack Internet Architecture

Jun 25, 2020 · Big Data

Step-by-Step Guide to Installing Elasticsearch 7.x, Elasticsearch‑head, and Sample 6.x Configuration

This article provides a comprehensive tutorial on installing a single‑node Elasticsearch 7.x cluster, configuring its key settings, setting up the Elasticsearch‑head web UI, and includes a reference 6.x configuration file for production environments.

Big DataElasticsearchInstallation

0 likes · 8 min read

Step-by-Step Guide to Installing Elasticsearch 7.x, Elasticsearch‑head, and Sample 6.x Configuration

Tencent Cloud Developer

Jun 24, 2020 · Industry Insights

How Industrial Internet Is Reshaping China's Light Manufacturing: Trends, Challenges, and Opportunities

The article analyzes the rapid shift from "Made in China" to "Intelligent Manufacturing" driven by industrial internet, 5G, AI and big data, highlighting policy evolution, case studies across light industry, liquor production and hazardous chemicals, and Tencent Cloud's strategic role in enabling digital transformation.

5GAIBig Data

0 likes · 33 min read

How Industrial Internet Is Reshaping China's Light Manufacturing: Trends, Challenges, and Opportunities

Beike Product & Technology

Jun 24, 2020 · Big Data

Beike DMP Platform: Architecture, Implementation Challenges, and Business Impact

The article details Beike's Data Management Platform (DMP) built since May 2018, covering its overall architecture, data collection, processing, real-time profiling, storage solutions, application scenarios, achieved performance metrics, and future development directions.

BeikeBig DataDMP

0 likes · 9 min read

Beike DMP Platform: Architecture, Implementation Challenges, and Business Impact

Big Data and Microservices

Jun 24, 2020 · Industry Insights

What Is a Data Middle Platform and How It Boosts Business Agility

The article explains what a data middle platform is, why it differs from a traditional big‑data platform, the efficiency, collaboration and talent challenges it addresses, its definition as a data‑driven innovation layer built on big data, cloud and AI, and outlines its logical architecture centered on data APIs.

Artificial IntelligenceBig DataCloud Computing

0 likes · 6 min read

What Is a Data Middle Platform and How It Boosts Business Agility

dbaplus Community

Jun 20, 2020 · Big Data

What’s New in Apache Spark 3.0? Explore Dynamic Partition Pruning, AQE, and More

Apache Spark 3.0, released after a 21‑month development cycle, introduces dynamic partition pruning, adaptive query execution, accelerator‑aware scheduling, DataSource V2, enhanced pandas UDFs, new join hints, richer monitoring, ANSI‑SQL compatibility, SparkR vectorization, Kafka header support, and numerous platform upgrades, all backed by over 3,400 resolved issues.

Adaptive Query ExecutionApache SparkBig Data

0 likes · 17 min read

What’s New in Apache Spark 3.0? Explore Dynamic Partition Pruning, AQE, and More

Big Data Technology & Architecture

Jun 19, 2020 · Big Data

Comparison of Flink and Spark in Standalone and YARN Deployment Modes

This article compares Apache Flink and Apache Spark in both standalone and YARN deployment modes, detailing their architecture, job scheduling differences, and specific configurations such as Flink’s yarn‑cluster and yarn‑session modes versus Spark’s yarn‑client and yarn‑cluster modes.

Big DataComparisonFlink

0 likes · 4 min read

Comparison of Flink and Spark in Standalone and YARN Deployment Modes

dbaplus Community

Jun 18, 2020 · Databases

How a Hybrid Data Warehouse Transformed Banking Data Services

This article details the 2015 hybrid data‑warehouse design implemented at Guangdong Huaxing Bank, explaining its real‑time, historical, and archival layers, the data‑bus concept, and how mixing in‑memory, relational, and Hadoop technologies addressed modern banking data‑volume, latency, and unstructured‑data challenges.

BankingBig DataHadoop

0 likes · 20 min read

How a Hybrid Data Warehouse Transformed Banking Data Services

DataFunTalk

Jun 18, 2020 · Big Data

Real-time Data Processing at QuTouTiao: Flink + ClickHouse Architecture and Practices

QuTouTiao leverages Flink and ClickHouse to build a high‑performance real‑time analytics platform that supports hourly Hive pipelines and sub‑second ClickHouse queries, achieving sub‑second response for 80% of requests through streaming ingestion, exactly‑once semantics, multi‑cluster coordination, and optimized ClickHouse storage and connector designs.

Big DataClickHouseFlink

0 likes · 16 min read

Real-time Data Processing at QuTouTiao: Flink + ClickHouse Architecture and Practices

JD Retail Technology

Jun 17, 2020 · Operations

How JD’s Data Platforms Scaled for the 618 Mega‑Sale: Operations, Stress‑Testing, and Dual‑Stream Architecture

The article details JD’s data product teams’ systematic preparation for the 618 shopping festival, covering pressure estimation, capacity expansion, stress testing, emergency downgrade strategies, dual‑data‑center isolation, high‑fidelity end‑to‑end testing, and continuous monitoring to ensure stable, real‑time data services during massive traffic spikes.

Big DataData PlatformJD.com

0 likes · 10 min read

How JD’s Data Platforms Scaled for the 618 Mega‑Sale: Operations, Stress‑Testing, and Dual‑Stream Architecture