Tagged articles

3675 articles

Page 27 of 37

Mar 29, 2020 · Industry Insights

How Federated Learning Is Breaking Data Silos Across Clouds

This article examines the rise of federated learning as a solution to data islands, detailing regulatory pressures, technical foundations, industry implementations by WeBank, Tencent and VMware, and practical product workflows that enable secure, cross‑cloud AI collaboration.

Artificial IntelligenceBig DataData Collaboration

0 likes · 9 min read

How Federated Learning Is Breaking Data Silos Across Clouds

DataFunTalk

Mar 28, 2020 · Big Data

Applying Flink State Management for Real-Time Recommendation Scenarios

This article explains how Apache Flink's flexible state management can be leveraged to solve data correlation challenges in real‑time recommendation platforms, compares Flink with Spark and Storm, describes the underlying broadcast and managed state mechanisms, and provides a step‑by‑step implementation using Kafka, Druid, and custom broadcast functions.

Big DataFlinkReal-Time

0 likes · 14 min read

Applying Flink State Management for Real-Time Recommendation Scenarios

Programmer DD

Mar 27, 2020 · Big Data

How Leading Chinese Companies Scale Elasticsearch for Billions of Queries

This article surveys how major Chinese tech firms such as JD.com, Ctrip, Qunar, 58.com and Didi design, scale, and operate massive Elasticsearch clusters for search, real‑time analytics, and security, detailing architecture choices, shard strategies, data pipelines and performance optimizations.

Big DataDistributed SystemsElasticsearch

0 likes · 12 min read

How Leading Chinese Companies Scale Elasticsearch for Billions of Queries

Xianyu Technology

Mar 26, 2020 · Big Data

Scalable User Behavior Data Collection and Auto-Generated Datasets for Xianyu

Xianyu created a highly extensible user‑behavior collection framework that standardizes data into a common ODPS schema, uses JavaScript Proxy to intercept navigation and API calls, maps business metrics via JSON, aggregates reports to cut dataset‑creation effort from days to minutes while avoiding heavy full‑tracking overhead.

AnalyticsBig DataJavaScript

0 likes · 9 min read

Scalable User Behavior Data Collection and Auto-Generated Datasets for Xianyu

58 Tech

Mar 26, 2020 · Big Data

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

The article introduces LPA-Detector, an open‑source project that redesigns the Label Propagation Algorithm using Spark GraphX to add node confidence weights and relationship influence, achieving significant improvements in execution efficiency and detection accuracy for massive graph data in risk‑control scenarios.

Big DataRisk DetectionSpark

0 likes · 8 min read

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

360 Quality & Efficiency

Mar 24, 2020 · Big Data

Understanding Granularity in Data Warehouse Design

This article explains the concept of granularity in data warehouse design, describing data models composed of structures, operations, and constraints, illustrating how granularity affects storage detail, query performance, and resource consumption, and recommending a dual‑granularity approach to balance efficiency and analytical depth.

AnalyticsBig Datadata modeling

0 likes · 5 min read

Understanding Granularity in Data Warehouse Design

Big Data Technology & Architecture

Mar 23, 2020 · Big Data

Best Practices for Designing HBase RowKey to Avoid Hotspots

The article explains how to design HBase RowKeys by dispersing keys, controlling their length, and ensuring uniqueness, providing concrete techniques such as salting, hashing, reversing values, and a practical example with table creation to improve scan performance and prevent region hotspot issues.

Big DataHBaseHotSpot

0 likes · 6 min read

Best Practices for Designing HBase RowKey to Avoid Hotspots

dbaplus Community

Mar 19, 2020 · Big Data

Inside Ctrip Flight Ticket Data Warehouse: Evolution, Architecture, and Real‑Time Challenges

This article details the evolution of Ctrip's flight ticket data warehouse, describing its historical tech stack, current architecture—including Hive, Presto, ClickHouse, CrateDB, and Flink—data synchronization methods, layer design, quality monitoring, and a real‑time price‑monitoring use case.

Big DataCtripData Quality

0 likes · 19 min read

Inside Ctrip Flight Ticket Data Warehouse: Evolution, Architecture, and Real‑Time Challenges

Qunar Tech Salon

Mar 19, 2020 · Big Data

Apache Kafka Overview: Architecture, Features, and Usage

This article provides a comprehensive introduction to Apache Kafka, covering its high‑throughput distributed architecture, core concepts such as topics, partitions, brokers, producers and consumers, design goals, performance characteristics, deployment steps, configuration, and example code for producers, consumers, and Spring Boot integration.

Big DataDistributed SystemsKafka

0 likes · 39 min read

Apache Kafka Overview: Architecture, Features, and Usage

Big Data Technology Architecture

Mar 19, 2020 · Big Data

Hive Optimization Modes: Local, Parallel, Strict, and Uber

This article explains Hive's four optimization modes—Local, Parallel, Strict, and Uber—detailing their purpose, performance impact on small MapReduce jobs, and the specific configuration parameters required to enable each mode effectively.

Big Data

0 likes · 8 min read

Hive Optimization Modes: Local, Parallel, Strict, and Uber

Youzan Coder

Mar 18, 2020 · Big Data

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

AirflowBig DataData Governance

0 likes · 20 min read

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

Big Data Technology & Architecture

Mar 17, 2020 · Big Data

Quick Guide to Building a Canal‑Based Real‑Time Data Synchronization Platform on CentOS 7

This article walks through the end‑to‑end setup of a small‑scale data platform using Alibaba's Canal for MySQL binlog capture, covering the installation and configuration of MySQL, Zookeeper, Kafka, and Canal itself, and demonstrates real‑time change capture with sample DML operations.

Big DataCanalCentOS

0 likes · 20 min read

Quick Guide to Building a Canal‑Based Real‑Time Data Synchronization Platform on CentOS 7

58 Tech

Mar 16, 2020 · Fundamentals

Understanding Object Serialization: Principles, Frameworks, and Performance Optimizations

This article explains the concept of object serialization, compares generic formats like JSON/XML with binary approaches, discusses optimization principles, key performance metrics, and reviews major serialization frameworks such as Protobuf, Thrift, Hessian, Kryo, and Avro, while also covering TLV encoding, varint algorithms, and practical pitfalls.

Big DataBinaryMicroservices

0 likes · 16 min read

Understanding Object Serialization: Principles, Frameworks, and Performance Optimizations

DevOps

Mar 16, 2020 · Operations

JD.com DevOps Case Study: Agile Transformation, Continuous Delivery, and Organizational Practices

This case study examines JD.com’s evolution into a technology‑driven enterprise, detailing its corporate culture, the “ABCDE” technology strategy, the implementation of DevOps and agile practices through the CALMS framework, and how unified continuous‑delivery platforms and operational metrics have driven growth, efficiency, and pandemic response.

Big DataContinuous DeliveryDevOps

0 likes · 16 min read

JD.com DevOps Case Study: Agile Transformation, Continuous Delivery, and Organizational Practices

Top Architect

Mar 13, 2020 · Big Data

Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation

This article presents a comprehensive guide for synchronizing massive MySQL datasets to HBase, covering environment preparation, fast MySQL data loading techniques, and three practical pipelines—Sqoop, Kafka‑Thrift, and Kafka‑Flink—along with performance comparisons and optimization tips for large‑scale data processing.

Big DataFlinkHBase

0 likes · 24 min read

Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation

Meituan Technology Team

Mar 12, 2020 · Big Data

Data Governance Practices in Meituan Delivery: Architecture, Standards, and Security

Meituan Delivery’s data‑governance framework combines a four‑layer warehouse architecture with comprehensive business, technical, security, and resource‑management standards, continuous metadata and security controls, and tools such as Wherehows and QuickSight, delivering standardized, secure, and easily shareable data while guiding future optimization and emerging‑technology adoption.

Big DataData ArchitectureData Governance

0 likes · 27 min read

Data Governance Practices in Meituan Delivery: Architecture, Standards, and Security

Open Source Linux

Mar 12, 2020 · Big Data

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

This tutorial walks you through setting up a three‑node Hadoop 2.9.2 cluster on CentOS 7.5, covering environment preparation, password‑less SSH, user creation, JDK installation, Hadoop extraction, configuration file edits, directory setup, ownership changes, service startup, and verification via web UIs.

Big DataCentOSCluster Setup

0 likes · 13 min read

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

Tencent Tech

Mar 11, 2020 · Big Data

Scaling the Health Code: Tencent Cloud Elasticsearch at Billion-User Scale

Leveraging Tencent Cloud Elasticsearch, the nationwide COVID‑19 health code platform handled over 1.6 billion scans for more than 900 million users, achieving millisecond‑level search, seamless horizontal scaling, multi‑zone high availability, and robust security, while simplifying development through RESTful APIs and rich UI tools.

Big DataDistributed SystemsElasticsearch

0 likes · 12 min read

Scaling the Health Code: Tencent Cloud Elasticsearch at Billion-User Scale

Alibaba Cloud Developer

Mar 9, 2020 · Big Data

How Alibaba Digitally Managed 100,000 Employees’ Return to the Office

Alibaba leveraged a suite of digital solutions—including a big‑data entry‑control system, AI‑driven mask detection, smart‑robot meal scheduling, predictive parking, environment regulation, and contactless services—to orchestrate a safe, orderly return of over 100,000 staff across its global campuses.

AIBig DataDigital Transformation

0 likes · 9 min read

How Alibaba Digitally Managed 100,000 Employees’ Return to the Office

Big Data Technology & Architecture

Mar 8, 2020 · Big Data

Hive on Spark Tuning Parameters and Best Practices

This article explains how to tune Hive on Spark by adjusting driver, executor, and Hive configuration parameters—including CPU cores, memory allocations, dynamic allocation, and join thresholds—to achieve optimal performance when running on YARN.

Big DataPerformance TuningSpark

0 likes · 7 min read

Hive on Spark Tuning Parameters and Best Practices

Top Architect

Mar 6, 2020 · Big Data

Design and Integration of a Real-Time Log Analysis System Using Flume, Kafka, Storm, Drools, and Redis

This article details the design, installation, and modular integration of Flume, Kafka, Storm, Drools, and Redis to build a real‑time log analysis pipeline for ETL systems, discussing architecture, configuration, code examples, and practical considerations for scalability and fault tolerance.

Big DataDroolsFlume

0 likes · 24 min read

Design and Integration of a Real-Time Log Analysis System Using Flume, Kafka, Storm, Drools, and Redis

iQIYI Technical Product Team

Mar 6, 2020 · Big Data

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

To support over 100 million iQIYI members, the team rebuilt a real‑time log monitoring platform that gathers access, exception, Nginx and front‑end logs via a Venus‑Agent, streams them through Kafka to Spark Streaming and Flink, stores metrics in Druid, and provides minute‑level host and business alerts, achieving 80 % faster incident investigation, detecting 90 % of member complaints early, and generating more than 4,800 actionable alerts.

Big DataFlinkLog Analytics

0 likes · 11 min read

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

Tencent Cloud Middleware

Mar 6, 2020 · Operations

Choosing the Right Disk Strategy for High‑Throughput Kafka Clusters

This article examines how to select and configure disk solutions—single‑disk, multi‑directory, RAID, and LVM—for Apache Kafka deployments, comparing performance, cost, scalability, and reliability to help operators build stable, high‑throughput messaging infrastructures.

Big DataCloud ComputingDisk Design

0 likes · 16 min read

Choosing the Right Disk Strategy for High‑Throughput Kafka Clusters

Suning Technology

Mar 5, 2020 · Artificial Intelligence

Will Retail + Internet Healthcare Survive Post‑COVID? Key Insights

After the pandemic, Suning’s Retail Technology Research Institute examines how the convergence of retail and internet medical services can address rising healthcare demand, resource shortages, and infection risks, leveraging big data, AI, and e‑commerce logistics to create integrated, non‑contact medical solutions and new business models.

AIBig DataHealthcare

0 likes · 13 min read

Will Retail + Internet Healthcare Survive Post‑COVID? Key Insights

Ctrip Technology

Mar 5, 2020 · Big Data

Design and Optimization of Ctrip's Hotel Data Intelligence Platform Using ClickHouse

This article describes how Ctrip built a unified hotel data intelligence platform, evaluated various database solutions, selected ClickHouse as the primary engine, and implemented performance, high‑availability, and monitoring strategies to handle billions of records and thousands of concurrent queries.

Big DataClickHouseCtrip

0 likes · 13 min read

Design and Optimization of Ctrip's Hotel Data Intelligence Platform Using ClickHouse

dbaplus Community

Mar 3, 2020 · Big Data

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo's practical experience with Kafka in its big‑data platform, covering three core usage scenarios, a four‑stage evolution roadmap—including version upgrades, resource isolation, security and monitoring—and future plans such as transaction‑based deduplication and consumer throttling.

Big DataKafkaResource Isolation

0 likes · 17 min read

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

ITPUB

Mar 2, 2020 · Big Data

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

This article explains ZooKeeper’s architecture, key concepts such as roles, sessions, ZNodes, versioning, ACLs, and watchers, and demonstrates how it powers essential big‑data components like Hadoop’s ResourceManager and HBase’s master election, naming service, and distributed locking.

Big DataDistributed CoordinationHBase

0 likes · 23 min read

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

Beike Product & Technology

Feb 27, 2020 · Big Data

Real‑Time Computing with Apache Flink at Beike Zhaofang: Hermes Platform Overview and Future Plans

This article presents the evolution, architecture, and operational metrics of Beike Zhaofang's Hermes real‑time computing platform built on Apache Flink, detailing its business scale, SQL editors, task growth, monitoring, use cases, and future development directions.

Apache FlinkBig DataReal-time Streaming

0 likes · 10 min read

Real‑Time Computing with Apache Flink at Beike Zhaofang: Hermes Platform Overview and Future Plans

Alibaba Cloud Developer

Feb 27, 2020 · Databases

How Cloud‑Native Distributed Databases Are Shaping the Future of Enterprise Data

This article reviews the evolution, market trends, core components, architectural challenges, and emerging technologies of cloud‑native distributed database systems, highlighting Alibaba Cloud's solutions such as POLARDB, AnalyticDB, and AI‑driven management platforms that enable elastic, high‑availability, and intelligent data services for modern enterprises.

Alibaba CloudBig DataHTAP

0 likes · 26 min read

How Cloud‑Native Distributed Databases Are Shaping the Future of Enterprise Data

Suning Technology

Feb 25, 2020 · Operations

How Post-Pandemic Retail Is Reinvented: Trends, Tech, and Opportunities

The Suning Retail Technology Research Institute analyzes post‑COVID retail trends, highlighting shifts in consumer behavior, the rise of product traceability, smart masks, AI‑enabled smart homes, remote work, online healthcare, and community group buying, while outlining the technologies driving these changes.

AIBig Datapost-pandemic

0 likes · 8 min read

How Post-Pandemic Retail Is Reinvented: Trends, Tech, and Opportunities

Big Data Technology & Architecture

Feb 24, 2020 · Big Data

Apache Ozone: Architecture, Design Principles, and Deployment Guide

This article introduces Apache Ozone, a scalable distributed object storage system for Hadoop, covering its background, core components, design principles, architecture, deployment steps, configuration examples, and basic command‑line operations for managing volumes, buckets, and keys.

Big DataCLIDeployment

0 likes · 18 min read

Apache Ozone: Architecture, Design Principles, and Deployment Guide

Suning Technology

Feb 22, 2020 · Big Data

How SuNing’s Big Data Engine Powers Health‑Code Pandemic Management

During the COVID‑19 pandemic, SuNing launched a public travel information registration system that leverages massive big‑data processing, high‑concurrency architecture, Kafka streaming, and real‑time analytics to create a city‑wide health‑code network, enabling precise epidemic control, mobility tracking, and robust data privacy safeguards.

Big DataHealth Codedata privacy

0 likes · 5 min read

How SuNing’s Big Data Engine Powers Health‑Code Pandemic Management

Qunar Tech Salon

Feb 21, 2020 · Artificial Intelligence

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

The article describes how Alibaba's XiaoMi AI platform constructs a closed‑loop pipeline—from data collection and annotation to model training, evaluation, and real‑time deployment—using multi‑dimensional data processing, visualization, and Spark‑based engines to accelerate iterative improvements and address operational pain points.

AIBig DataModel Training

0 likes · 9 min read

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

21CTO

Feb 19, 2020 · Big Data

Building an Open-Source Big Data Analytics Stack: Challenges & Benefits

The article explains why modern companies rely on data‑driven decisions, outlines the two main challenges of tracking data and connecting it to BI, describes the three‑step analytics stack (integration, warehouse, analysis), and highlights the cost, flexibility, and security advantages of open‑source tools.

Big DataData AnalyticsData Integration

0 likes · 5 min read

Building an Open-Source Big Data Analytics Stack: Challenges & Benefits

DataFunTalk

Feb 19, 2020 · Big Data

Design and Integration of Flink Batch Processing with Hive: Architecture, Features, and Performance Evaluation

This article presents the design of Flink's batch processing architecture, its integration with Hive through a unified Catalog API, details the enhancements in Flink 1.10, outlines future work, and reports a performance test showing roughly seven‑fold speedup over Hive on MapReduce.

Batch ProcessingBig DataCatalog API

0 likes · 9 min read

Design and Integration of Flink Batch Processing with Hive: Architecture, Features, and Performance Evaluation

Big Data Technology Architecture

Feb 17, 2020 · Big Data

Evolution of Apache Kafka Versions and Their Key Features

This article reviews the historical evolution of Apache Kafka versions, explains the versioning scheme, highlights major features introduced in each release from 0.7.x to 2.x, and provides practical recommendations for selecting an appropriate Kafka version.

Big DataProducer ConsumerVersioning

0 likes · 9 min read

Evolution of Apache Kafka Versions and Their Key Features

MaGe Linux Operations

Feb 17, 2020 · Operations

How to Efficiently Split and Merge Large Log Files on Linux

When log files grow massive, traditional tools like vim, cat, grep, and awk become slow and memory‑hungry, but Linux’s split command lets you divide a huge file by line count or size, process the pieces individually, and later recombine them, dramatically improving analysis efficiency.

Big DataShell scriptingfile-handling

0 likes · 8 min read

How to Efficiently Split and Merge Large Log Files on Linux

DataFunTalk

Feb 17, 2020 · Artificial Intelligence

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

This article explains how Alibaba’s XiaoMi team constructs a full‑cycle AI pipeline—covering real‑time and offline data processing, high‑dimensional visualization, model training, iterative feedback, and Spark‑based deployment—to accelerate intelligent product iteration while addressing common engineering pain points.

AIBig DataReal-time Processing

0 likes · 10 min read

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

Big Data Technology & Architecture

Feb 16, 2020 · Big Data

Implementing User Purchase Behavior Tracking with Flink Broadcast State

This article explains how to use Flink's Broadcast State to track user purchase paths in real time, detailing the design, required Kafka streams, Java APIs, state management, dynamic configuration, code implementation, deployment steps, and example results for a big‑data streaming application.

Big DataBroadcast StateFlink

0 likes · 19 min read

Implementing User Purchase Behavior Tracking with Flink Broadcast State

Big Data Technology & Architecture

Feb 16, 2020 · Big Data

Implementing MySQL Binlog Synchronization to HDFS Using Canal

This article details a step‑by‑step guide for deploying Canal to capture MySQL binlog events, configure HA with ZooKeeper, design a client that parses binlog into JSON, asynchronously acknowledges messages, archive data to local files for batch upload to HDFS, and monitor latency for alerts.

Big DataBinlogCanal

0 likes · 10 min read

Implementing MySQL Binlog Synchronization to HDFS Using Canal

Suning Technology

Feb 15, 2020 · Artificial Intelligence

How AI and Unmanned Tech Are Redefining Retail in the Post‑Pandemic Era

The COVID‑19 pandemic accelerated instant consumption and O2O integration, prompting retailers to adopt AI‑driven unmanned stores, big‑data traceability, smart‑home solutions, and innovative mask and health‑product strategies, reshaping supply chains, operations, and consumer experiences.

AIBig DataCOVID-19

0 likes · 12 min read

How AI and Unmanned Tech Are Redefining Retail in the Post‑Pandemic Era

Big Data Technology & Architecture

Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewHadoop

0 likes · 11 min read

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

Tencent Cloud Developer

Feb 13, 2020 · Big Data

Data Middle Platform: Vision, Architecture, and Business Value

The Data Middle Platform, described by Shi Kai, is a service‑oriented architecture that transforms raw enterprise data into reusable, real‑time APIs for business applications, bridging the gap between traditional warehouses and front‑end systems, accelerating digital transformation through unified governance, rapid development, and direct business value.

Big DataData ArchitectureData Middle Platform

0 likes · 26 min read

Data Middle Platform: Vision, Architecture, and Business Value

Big Data Technology & Architecture

Feb 10, 2020 · Big Data

Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell

This article explains how to use Alibaba's Canal to capture MySQL binlog changes in real time, covering its underlying protocol, component architecture, HA design with ZooKeeper, configuration steps, deployment examples, and a detailed comparison with alternative tools such as Maxwell and mysql_streamer.

Big DataBinlogCanal

0 likes · 17 min read

Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell

58 Tech

Feb 10, 2020 · Big Data

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

This article systematically describes the challenges, design principles, modeling methods, layered architecture, implementation steps, and standards used in building a comprehensive user behavior data warehouse for 58.com, highlighting practical experiences and future improvement directions.

Big DataData QualityETL

0 likes · 11 min read

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

Big Data Technology & Architecture

Feb 9, 2020 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer to store serialized key/value pairs and their metadata, detailing its structure, initialization, write path, spill logic, and the background thread that sorts and writes data to disk.

Big DataHadoopMapReduce

0 likes · 24 min read

Understanding Hadoop's Circular Buffer in the Shuffle Phase

Big Data Technology & Architecture

Feb 8, 2020 · Big Data

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

This article explains why Spark is a mature big‑data framework, recommends which Spark versions to study, lists essential research papers, describes how to set up the development environment, and outlines the key components of Spark’s core architecture for effective source‑code exploration.

Apache SparkBig DataRDD

0 likes · 6 min read

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

Big Data Technology & Architecture

Feb 6, 2020 · Big Data

Comparison of Hudi, Iceberg, and Delta Lake Table Formats

This article compares the design goals, data‑lake table formats—Hudi, Iceberg, and Delta—highlighting their common reliance on meta files and their distinct strengths for upserts, analytics, and unified streaming‑batch processing in modern big‑data environments.

Big DataData LakeDelta Lake

0 likes · 10 min read

Comparison of Hudi, Iceberg, and Delta Lake Table Formats

HomeTech

Feb 6, 2020 · Product Management

AutoBI One‑Stop Data Visualization Platform: Architecture, Technical Highlights, and Use Cases

The document outlines AutoBI, a company‑wide one‑stop data visualization platform, detailing its background, overall architecture, key technical components such as real‑time/offline data switching and query processing, integration capabilities, and practical case studies, highlighting efficiency gains and future development plans.

BackendBig DataDashboard

0 likes · 8 min read

AutoBI One‑Stop Data Visualization Platform: Architecture, Technical Highlights, and Use Cases

Big Data Technology & Architecture

Feb 5, 2020 · Big Data

Resolving Oozie Shell Scheduling Issues for Flink Jobs on CDH 6.3 with Kerberos Authentication

The article describes how to troubleshoot and fix Oozie shell‑action failures when submitting Flink jobs on a CDH 6.3 cluster with Kerberos, detailing environment‑variable conflicts, error messages, and the final solution using a clean environment and custom FLINK_CONF_DIR settings.

Big DataCDHFlink

0 likes · 7 min read

Resolving Oozie Shell Scheduling Issues for Flink Jobs on CDH 6.3 with Kerberos Authentication

360 Quality & Efficiency

Feb 5, 2020 · Artificial Intelligence

Key Takeaways from AICon: AI Fundamentals, Applications, and Future Directions

The article shares notes from the AICon global AI and machine learning conference, outlining AI’s three core elements—computing power, big data, and algorithms—its problem domains, current applications across industries, and future directions such as AI‑IoT‑5G integration.

AI ConferenceArtificial IntelligenceBig Data

0 likes · 6 min read

Key Takeaways from AICon: AI Fundamentals, Applications, and Future Directions

Youzan Coder

Feb 5, 2020 · Backend Development

Configurable Data Reconciliation Platform at Youzan: Design, Architecture, and Implementation

Youzan built a configurable data reconciliation platform that integrates new scenarios, processes massive real‑time and batch data, offers visual monitoring, automated correction, and flexible Groovy‑based logic across four DDD layers, achieving 99.99% stability while simplifying detection and resolution of cross‑system inconsistencies.

Big DataData ReconciliationDistributed Systems

0 likes · 15 min read

Configurable Data Reconciliation Platform at Youzan: Design, Architecture, and Implementation

Big Data Technology Architecture

Feb 1, 2020 · Big Data

Beike's Hermes Real‑Time Computing Platform: Architecture, Scale, and Future Roadmap

The article presents a comprehensive case study of Beike's Hermes real‑time computing platform, detailing its business evolution, Hermes architecture, SQL V1/V2 editors built on Spark and Flink, large‑scale deployment statistics, monitoring, diverse business use cases, and planned future enhancements.

Apache FlinkBeikeBig Data

0 likes · 11 min read

Beike's Hermes Real‑Time Computing Platform: Architecture, Scale, and Future Roadmap

Big Data Technology & Architecture

Jan 30, 2020 · Big Data

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

This article provides an in‑depth, step‑by‑step guide to optimizing Spark jobs, covering development‑time best practices, resource‑parameter tuning, data‑skew detection and mitigation techniques, and shuffle‑stage performance tweaks, complete with Scala code examples and practical recommendations.

Big DataData SkewResource Tuning

0 likes · 67 min read

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

Big Data Technology & Architecture

Jan 25, 2020 · Big Data

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

This article demonstrates how to generate 500 million visitor IDs with Spark, use map‑reduce operations to count occurrences, and identify the ID with the highest visit count, while discussing performance considerations such as memory spilling and cluster resources.

Big DataRDDScala

0 likes · 11 min read

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

Big Data Technology & Architecture

Jan 20, 2020 · Big Data

Understanding Data Middle Platform: Architecture, Components, and Operational Practices

The article explains the concept, architecture, and key components of a data middle platform—including data aggregation, development, asset management, service systems, and operational and security mechanisms—while also promoting related books and a giveaway.

Big DataData ArchitectureData Governance

0 likes · 7 min read

Understanding Data Middle Platform: Architecture, Components, and Operational Practices

Alibaba Cloud Developer

Jan 20, 2020 · Big Data

Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing

This article details how Alibaba migrated its massive Taobao‑Tmall search workload to the search offline platform, tackling challenges of massive data volume, one‑to‑many joins, and hotspot sellers through a series of performance optimizations—including local joins, salt‑based data sharding, dynamic aggregation jobs, and asynchronous processing—to achieve high‑throughput full loads and low‑latency incremental updates.

AlibabaBig DataFlink

0 likes · 15 min read

Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing

Big Data Technology & Architecture

Jan 19, 2020 · Big Data

Tencent's Elasticsearch Practices: Application Scenarios, Challenges, Optimizations, and Future Directions

This article details how Tencent leverages Elasticsearch for log analysis, search services, and time‑series data, outlines the specific challenges faced in high‑availability and cost‑efficiency, and presents the comprehensive optimization techniques and future open‑source contributions that improve performance, scalability, and reliability.

Big DataCost OptimizationElasticsearch

0 likes · 16 min read

Tencent's Elasticsearch Practices: Application Scenarios, Challenges, Optimizations, and Future Directions

Tencent Cloud Developer

Jan 19, 2020 · Backend Development

Tencent Kona JDK: OpenJDK Foundations, Technical Trends, and Big Data Practices

The talk reviews OpenJDK’s evolution, contrasts Oracle JDK, introduces Tencent’s Kona JDK as a free, long‑term, production‑hardened fork optimized for massive micro‑service and big‑data workloads, and discusses emerging Java‑on‑Java, value‑type, Project Panama/Loom, and SIMD Vector API trends shaping JVM performance.

Big DataCloud ComputingJVM

0 likes · 15 min read

Tencent Kona JDK: OpenJDK Foundations, Technical Trends, and Big Data Practices

Big Data Technology & Architecture

Jan 16, 2020 · Big Data

Kafka Interview Guide: Core Concepts, Architecture, and Practical Tips

This article compiles essential Kafka interview material, covering its role as a message queue, usage scenarios, architectural components, storage mechanisms, consumer group rebalancing, high‑availability features, replication details, ordering guarantees, producer/consumer client design, topic management, log retention, performance optimizations, and key monitoring metrics.

Big DataDistributed SystemsKafka

0 likes · 16 min read

Kafka Interview Guide: Core Concepts, Architecture, and Practical Tips

360 Tech Engineering

Jan 16, 2020 · Big Data

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

This article presents a comprehensive real‑time and offline integrated solution for a channel analysis system, detailing challenges, architecture, implementation using Flink, Spark Streaming, Kafka, Elasticsearch, and HIVE, and demonstrating minute‑level latency and high accuracy through performance evaluations.

Big DataElasticsearchFlink

0 likes · 10 min read

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

Architects Research Society

Jan 16, 2020 · Big Data

Elasticsearch vs Solr: Choosing the Right Open‑Source Search Engine

This article compares Elasticsearch and Solr, examining their history, community, licensing, core technologies, APIs, scalability, vendor support, ecosystem, performance, management tools, and visualization options to help organizations decide which open‑source search engine best fits their big‑data and search requirements.

Big DataElasticsearchSolr

0 likes · 12 min read

Elasticsearch vs Solr: Choosing the Right Open‑Source Search Engine

Big Data Technology & Architecture

Jan 13, 2020 · Big Data

Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration

This article explains the ORC (Optimized Record Columnar) file format used in Hive, covering its architecture, stripe and column storage, handling of complex data types, indexing mechanisms, compression streams, memory management, and key configuration parameters.

Big DataFile FormatORC

0 likes · 14 min read

Understanding ORC File Format in Hive: Structure, Storage, Indexes, Compression, and Configuration

Big Data Technology & Architecture

Jan 10, 2020 · Big Data

Async I/O for Dimension Table Joins in Apache Flink

This article explains how to handle dimension table joins in Apache Flink streaming by leveraging Async I/O to perform non‑blocking external lookups, provides detailed code examples for both synchronous and asynchronous functions, discusses configuration parameters, and outlines best practices and pitfalls.

Big DataDimension Table JoinFlink

0 likes · 16 min read

Async I/O for Dimension Table Joins in Apache Flink

ITPUB

Jan 10, 2020 · Big Data

How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo’s practical experience using Kafka across three core scenarios—real‑time storage, analytical data source, and business data subscription—while describing a four‑stage evolution that includes version upgrades, resource isolation, security and monitoring enhancements, and a comprehensive subscription platform, followed by future improvement plans.

Big DataData ReplayKafka

0 likes · 16 min read

How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices

Architects' Tech Alliance

Jan 9, 2020 · Big Data

Building a Data Middle Platform: Practices and Architecture at NetEase Yanxuan

The article explains why companies are building data middle platforms, defines what a data middle platform is, and details NetEase Yanxuan’s architecture, including its data warehouse, data services, and BI platform, illustrating how these components enable data‑driven transformation and fine‑grained operations.

BIBig DataData Middle Platform

0 likes · 11 min read

Building a Data Middle Platform: Practices and Architecture at NetEase Yanxuan

DataFunTalk

Jan 9, 2020 · Databases

Exploring Spatiotemporal Data Management with Cassandra, GeoMesa, and GeoTrellis

This article presents a comprehensive overview of handling spatiotemporal data using Cassandra, covering data types, space‑filling curves, GeoHash encoding, the GeoMesa and GeoTrellis ecosystems, Cassandra storage schemas, and practical Spark integration for large‑scale geospatial analytics.

Big DataGeoMesaGeoTrellis

0 likes · 8 min read

Exploring Spatiotemporal Data Management with Cassandra, GeoMesa, and GeoTrellis

iQIYI Technical Product Team

Jan 9, 2020 · Big Data

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

iQIYI’s Real‑Time Analysis Platform (RAP) combines Apache Druid with Spark/Flink to deliver minute‑level, low‑latency multidimensional analytics via a web wizard, supporting hundreds of streaming tasks and thousands of reports across membership, recommendation, and TV monitoring, while simplifying development and maintenance.

Apache DruidBig DataFlink

0 likes · 13 min read

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

Big Data Technology & Architecture

Jan 8, 2020 · Big Data

Real-Time Data Warehouse Architecture and Challenges Using Flink, Kafka, and HBase

This article examines the design of a real-time data warehouse built on Flink, Kafka, and HBase, compares it with traditional offline warehouses, and discusses key challenges such as data accuracy, latency, and the complexity of maintaining real-time dimension tables.

Big DataFlinkHBase

0 likes · 10 min read

Real-Time Data Warehouse Architecture and Challenges Using Flink, Kafka, and HBase

Big Data Technology & Architecture

Jan 7, 2020 · Big Data

Real-time Data Processing with Kafka, Spark Streaming, and HBase: Implementation Guide

This article presents a step‑by‑step guide for building a real‑time data pipeline using Kafka as a message buffer, Spark‑Streaming's Direct Approach for processing, and HBase for storage, including code examples, Maven configuration, local cluster setup, and troubleshooting tips.

Big DataHBaseKafka

0 likes · 12 min read

Real-time Data Processing with Kafka, Spark Streaming, and HBase: Implementation Guide

Python Programming Learning Circle

Jan 7, 2020 · Fundamentals

Which Tech Skills Will Make You Irreplaceable in Today’s Job Market?

In a fiercely competitive internet era, technical professionals must continuously learn across fields such as information security, Python, cloud computing, big data, AI, software testing, IoT, and internet marketing to become the highly sought‑after talent that companies urgently need.

Artificial IntelligenceBig DataCloud Computing

0 likes · 7 min read

Which Tech Skills Will Make You Irreplaceable in Today’s Job Market?

Tongcheng Travel Technology Center

Jan 7, 2020 · Big Data

Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn

The article describes the evolution from the legacy XDATA tool to the new XFlink system, detailing its architecture, core plugins, parser and deployment modules, resource management with Yarn, monitoring via Prometheus and Grafana, and planned enhancements such as Flink SQL configuration and modular plugins.

Big DataData MigrationDistributed Systems

0 likes · 10 min read

Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn

Big Data Technology & Architecture

Jan 7, 2020 · Big Data

Using HyperLogLog for High-Performance Pre-Aggregation in Big Data with Spark-Alchemy

The article explains how pre‑aggregation combined with the HyperLogLog algorithm and Spark‑Alchemy's native HLL functions can dramatically accelerate distinct‑count calculations in big‑data workloads while maintaining low error rates and cross‑system compatibility.

Approximate Distinct CountBig DataHyperLogLog

0 likes · 7 min read

Using HyperLogLog for High-Performance Pre-Aggregation in Big Data with Spark-Alchemy

Top Architect

Jan 7, 2020 · Big Data

Technical Architecture Overview of Toutiao: Data Processing, User Modeling, and Recommendation System

This article provides a comprehensive overview of Toutiao's rapid growth and technical architecture, detailing its massive user base, data collection pipelines, user modeling, recommendation engines, storage solutions, message push mechanisms, micro‑service design, and virtualization PaaS platform.

ArchitectureBig DataMicroservices

0 likes · 8 min read

Technical Architecture Overview of Toutiao: Data Processing, User Modeling, and Recommendation System

dbaplus Community

Jan 6, 2020 · Big Data

How 58.com Built a Scalable Flink‑Based Real‑Time Data Platform (Wstream)

The article details how 58.com designed and evolved its one‑stop real‑time computation platform Wstream, migrating from Storm and Spark Streaming to Apache Flink, and describes the architecture, task isolation, stream‑SQL features, monitoring, and ongoing optimizations that enable processing of over 600 billion records daily.

Big DataFlinkReal-time Streaming

0 likes · 12 min read

How 58.com Built a Scalable Flink‑Based Real‑Time Data Platform (Wstream)

Tencent Cloud Developer

Jan 6, 2020 · Big Data

Overview of TubeMQ: Principles, Architecture, Performance, and Open‑Source Strategy for Big‑Data Message Queues

TubeMQ is a trillion‑level, Java‑based distributed message‑queue middleware designed for massive‑data ingestion, offering 140 k TPS with sub‑5 ms latency, high reliability, low cost, and horizontal scalability, and is being open‑sourced to the Apache foundation to foster community collaboration and future expansion beyond traditional MQ functions.

Big DataDistributed SystemsMessage Queue

0 likes · 15 min read

Overview of TubeMQ: Principles, Architecture, Performance, and Open‑Source Strategy for Big‑Data Message Queues

58 Tech

Jan 6, 2020 · Big Data

Design and Architecture of the 58DP Big Data Platform Task Scheduling System

The article presents a comprehensive overview of the 58DP big data platform's task scheduling system, detailing its background, architecture, high‑availability design, slot‑based resource management, scheduling models, task lifecycle, priority rules, dependency handling, failure recovery, and future enhancements.

Big DataResource Managementdistributed system

0 likes · 14 min read

Design and Architecture of the 58DP Big Data Platform Task Scheduling System

Didi Tech

Jan 5, 2020 · Big Data

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

The team performed a rolling upgrade of HDFS from 2.7 to 3.2 on large clusters, resolving EditLog, Fsimage, StringTable and authentication incompatibilities by omitting EC data, using fallback images, rolling back commits and first upgrading to the latest 2.x release, following a staged JournalNode‑NameNode‑DataNode procedure, validating with rehearsals and a custom trash‑management tool, and achieving uninterrupted service, improved stability, performance and cost efficiency.

Big DataCluster MigrationHDFS

0 likes · 11 min read

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

Big Data Technology & Architecture

Jan 2, 2020 · Big Data

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

This article provides a comprehensive overview of Apache Spark Structured Streaming, describing its declarative API, the challenges of stream processing, the programming model with code examples, query planning, execution modes, production use cases, and performance benchmarks compared with other streaming systems.

Big DataSparkStreaming

0 likes · 42 min read

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

Mafengwo Technology

Jan 2, 2020 · Big Data

How We Scaled Kafka for Real‑Time Big Data at Mafengwo: Lessons and Practices

This article details Mafengwo's practical experience using Kafka within its big‑data platform, covering application scenarios, evolution through version upgrades, resource isolation, security and monitoring enhancements, and future plans for data duplication handling and consumer throttling.

Big DataData StreamingKafka

0 likes · 16 min read

How We Scaled Kafka for Real‑Time Big Data at Mafengwo: Lessons and Practices

DataFunTalk

Jan 2, 2020 · Big Data

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

This article presents an in‑depth overview of ByteDance’s large‑scale HDFS deployment, describing its unique access layer, metadata and data layers, the evolution through multiple growth stages, and the key architectural improvements such as NNProxy, DanceNN, lock redesign, startup acceleration, and slow‑node mitigation techniques.

Big DataByteDanceFederation

0 likes · 18 min read

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

dbaplus Community

Jan 1, 2020 · Big Data

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

Facebook migrated a massive, multi‑stage Hive‑based entity ranking pipeline to a single Spark job, detailing the challenges of scaling to 20 TB inputs, the reliability fixes, performance optimizations, and the resulting 4‑6× CPU speedup and reduced latency.

Big DataReliabilitySpark

0 likes · 16 min read

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

Tongcheng Travel Technology Center

Dec 31, 2019 · Big Data

Apache Kylin Overview and Model Optimization Practices for Trajectory Analytics

This article introduces Apache Kylin, details its deployment at Tongcheng Yilong, explains the design of a large‑scale trajectory model, and provides step‑by‑step optimization techniques—including cube dimension reduction, HBase rowkey tuning, build parameter tweaks, high‑cardinality handling, and query compression disabling—to achieve sub‑second OLAP queries on multi‑terabyte data.

Apache KylinBig DataCube

0 likes · 17 min read

Apache Kylin Overview and Model Optimization Practices for Trajectory Analytics

Cloud Native Technology Community

Dec 30, 2019 · Big Data

Kafka 2.4.0 Release Summary: New Features, Improvements, and Bug Fixes

The article provides a comprehensive overview of Apache Kafka 2.4.0, detailing its major new capabilities such as consumer replica fetching, progressive cooperative rebalancing, MirrorMaker 2.0, new Java authentication APIs, and extensive bug fixes, along with upgrade considerations and related resources.

Apache KafkaBig DataRelease Notes

0 likes · 26 min read

Kafka 2.4.0 Release Summary: New Features, Improvements, and Bug Fixes

DataFunTalk

Dec 30, 2019 · Databases

Cassandra: Past, Present, and Future – History, Architecture, Features, and Use Cases

This article summarizes a Cassandra meetup presentation that traces the database's origins from BigTable and Dynamo, outlines its key milestones, explains its peer‑to‑peer and LSM architecture, highlights current features, real‑world deployments, performance advantages, and previews upcoming 4.0 releases and community projects.

Big DataGossip ProtocolLSM

0 likes · 14 min read

Cassandra: Past, Present, and Future – History, Architecture, Features, and Use Cases

dbaplus Community

Dec 29, 2019 · Databases

What New Database Versions and Trends Shaped 2019? A Comprehensive Review

The 2019 dbaplus Newsletter compiles a detailed overview of major RDBMS, NoSQL, NewSQL, big‑data, Chinese and cloud database releases, highlighting key features, performance improvements, security enhancements, and future road‑maps for each product.

Big DataCloud ComputingNewSQL

0 likes · 40 min read

What New Database Versions and Trends Shaped 2019? A Comprehensive Review

Java High-Performance Architecture

Dec 29, 2019 · Fundamentals

Which Technologies Will Dominate Software Development in 2020? A Trend Forecast

This article forecasts the 2020 software development landscape, highlighting the rise of cloud adoption, Kubernetes, micro‑services, Python, Java, emerging languages like Rust and Kotlin, JavaScript frameworks, API standards, SQL dominance, big‑data engines Spark and Flink, and the growing impact of WebAssembly.

Big DataCloud ComputingMicroservices

0 likes · 9 min read

Which Technologies Will Dominate Software Development in 2020? A Trend Forecast

Efficient Ops

Dec 28, 2019 · Operations

What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends

The 2019 Enterprise IT Operations Whitepaper, released at the national Operations Conference, systematically examines the definition, value, key capabilities, industry applications, challenges, and future trends of IT operations across telecom, finance, Internet, and manufacturing sectors.

Artificial IntelligenceBig DataIT Operations

0 likes · 6 min read

What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends

360 Tech Engineering

Dec 27, 2019 · Big Data

Introduction to ElasticSearch: Core Concepts, Architecture, and Common Operations

This article provides a comprehensive overview of ElasticSearch, covering its distributed architecture, fundamental components such as nodes, shards, and indices, as well as practical guidance on index design, mapping, bulk operations, query processing, scroll searches, alias management, and performance tuning tips.

Big DataClusterMapping

0 likes · 11 min read

Introduction to ElasticSearch: Core Concepts, Architecture, and Common Operations

ITPUB

Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataReliabilitySpark

0 likes · 16 min read

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Huawei Cloud Developer Alliance

Dec 27, 2019 · Big Data

How to Compile and Install CDH Hadoop on Kunpeng Cloud: Step‑by‑Step Guide

This article walks through the full‑stack process of migrating and compiling the CDH Hadoop distribution on Kunpeng cloud servers, covering environment setup, dependency installation, source code adjustments, common build errors, and final packaging for a production‑ready big‑data platform.

Big DataCDHCompilation

0 likes · 14 min read

How to Compile and Install CDH Hadoop on Kunpeng Cloud: Step‑by‑Step Guide

21CTO

Dec 26, 2019 · Artificial Intelligence

Will AI and Machine Learning Redefine Software Testing in 2020?

The article outlines five major 2020 software testing trends—including the surge of AI/ML, digital transformation, cloud and IoT adoption, the shift from performance testing to performance engineering, and the growing importance of big‑data testing—highlighting their impact on quality assurance practices.

AIBig DataCloud Computing

0 likes · 7 min read

Will AI and Machine Learning Redefine Software Testing in 2020?

Big Data Technology & Architecture

Dec 25, 2019 · Big Data

Understanding Flink StreamPartitioner and Its Implementations

Flink’s StreamPartitioner abstracts data routing in DataStream, offering eight built‑in partitioners—including Global, Shuffle, Rebalance, KeyGroup, Broadcast, Rescale, Forward, and Custom—each with distinct channel selection logic, illustrated with source code snippets and explanations of their runtime behavior.

Big DataDataStreamFlink

0 likes · 8 min read

Understanding Flink StreamPartitioner and Its Implementations

Tongcheng Travel Technology Center

Dec 25, 2019 · Big Data

Recap of Tongcheng Elong 5th Big Data Technology and Application Salon (2019)

The article reviews the 2019 Tongcheng Elong Big Data Technology and Application Salon, summarizing six expert talks on data middle platforms, intelligent marketing, real‑time recommendation, Apache Pulsar, Chinese entity recognition, and hotel ranking models, plus event highlights and future plans.

Apache PulsarBig DataData Platform

0 likes · 5 min read

Recap of Tongcheng Elong 5th Big Data Technology and Application Salon (2019)

DataFunTalk

Dec 24, 2019 · Big Data

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

This article explains PySpark's multi‑process architecture, how the Python driver uses Py4J to call Java/Scala APIs, the implementation of RDD and DataFrame interfaces, executor‑side process communication and serialization with Arrow, and the design of Pandas UDFs, while also discussing current limitations and future directions.

ArrowBig DataPySpark

0 likes · 13 min read

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

dbaplus Community

Dec 23, 2019 · Databases

How to Deploy, Scale, and Monitor ClickHouse for High‑Performance Big Data Analytics

This article explains ClickHouse's deployment architecture, read‑write separation, shard expansion steps, write‑batch strategies, a three‑layer monitoring model, and its practical application in Tencent's game analytics platform, offering concrete guidance for building a stable, high‑throughput analytics service.

Big DataDeploymentGame Analytics

0 likes · 21 min read

How to Deploy, Scale, and Monitor ClickHouse for High‑Performance Big Data Analytics

DataFunTalk

Dec 23, 2019 · Databases

Cassandra Deployment and Optimization at 360 Cloud Storage

This article details how 360 adopted Cassandra for its cloud drive, describing Cassandra’s decentralized architecture, the reasons for its selection over HBase, large‑scale deployment challenges, performance optimizations, reliability improvements, disk utilization techniques, and the evolution of the system from 2010 to present.

Big DataCloud StorageData Reliability

0 likes · 15 min read

Cassandra Deployment and Optimization at 360 Cloud Storage

Big Data Technology & Architecture

Dec 22, 2019 · Big Data

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

The article explains Spark's default static resource allocation, analyzes the limitations of its Dynamic Resource Allocation (DRA) for streaming workloads, describes the internal Spark components and code paths involved, and proposes concrete design and configuration recommendations for implementing more responsive executor scaling.

Big DataDynamic Resource AllocationExecutor Management

0 likes · 11 min read

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

Big Data Technology & Architecture

Dec 22, 2019 · Big Data

Implementing Multi‑threaded Kafka Consumer and Producer with Partition Management

This article explains how to build a multi‑threaded Kafka consumer and producer in Java, covering partition concepts, consumer group offsets, thread‑pool configuration, and code examples that demonstrate proper use of Kafka streams, partition keys, and batch message sending for improved throughput.

Big DataConsumerKafka

0 likes · 15 min read

Implementing Multi‑threaded Kafka Consumer and Producer with Partition Management

Big Data Technology & Architecture

Dec 21, 2019 · Big Data

Kafka Offset Management and Replication Mechanisms Explained

This article provides a comprehensive technical overview of Kafka's offset handling, covering the request entry point, in‑memory offset sources, offset commit and fetch implementations, file storage layout, and the leader‑follower synchronization process that ensures data replication and high‑watermark updates.

Big DataDistributed SystemsHigh Watermark

0 likes · 16 min read

Kafka Offset Management and Replication Mechanisms Explained