Tagged articles
3675 articles
Page 27 of 37
Tencent Cloud Developer
Tencent Cloud Developer
Mar 29, 2020 · Industry Insights

How Federated Learning Is Breaking Data Silos Across Clouds

This article examines the rise of federated learning as a solution to data islands, detailing regulatory pressures, technical foundations, industry implementations by WeBank, Tencent and VMware, and practical product workflows that enable secure, cross‑cloud AI collaboration.

Artificial IntelligenceBig DataData Collaboration
0 likes · 9 min read
How Federated Learning Is Breaking Data Silos Across Clouds
DataFunTalk
DataFunTalk
Mar 28, 2020 · Big Data

Applying Flink State Management for Real-Time Recommendation Scenarios

This article explains how Apache Flink's flexible state management can be leveraged to solve data correlation challenges in real‑time recommendation platforms, compares Flink with Spark and Storm, describes the underlying broadcast and managed state mechanisms, and provides a step‑by‑step implementation using Kafka, Druid, and custom broadcast functions.

Big DataFlinkReal-Time
0 likes · 14 min read
Applying Flink State Management for Real-Time Recommendation Scenarios
Programmer DD
Programmer DD
Mar 27, 2020 · Big Data

How Leading Chinese Companies Scale Elasticsearch for Billions of Queries

This article surveys how major Chinese tech firms such as JD.com, Ctrip, Qunar, 58.com and Didi design, scale, and operate massive Elasticsearch clusters for search, real‑time analytics, and security, detailing architecture choices, shard strategies, data pipelines and performance optimizations.

Big DataDistributed SystemsElasticsearch
0 likes · 12 min read
How Leading Chinese Companies Scale Elasticsearch for Billions of Queries
Xianyu Technology
Xianyu Technology
Mar 26, 2020 · Big Data

Scalable User Behavior Data Collection and Auto-Generated Datasets for Xianyu

Xianyu created a highly extensible user‑behavior collection framework that standardizes data into a common ODPS schema, uses JavaScript Proxy to intercept navigation and API calls, maps business metrics via JSON, aggregates reports to cut dataset‑creation effort from days to minutes while avoiding heavy full‑tracking overhead.

AnalyticsBig DataJavaScript
0 likes · 9 min read
Scalable User Behavior Data Collection and Auto-Generated Datasets for Xianyu
58 Tech
58 Tech
Mar 26, 2020 · Big Data

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

The article introduces LPA-Detector, an open‑source project that redesigns the Label Propagation Algorithm using Spark GraphX to add node confidence weights and relationship influence, achieving significant improvements in execution efficiency and detection accuracy for massive graph data in risk‑control scenarios.

Big DataRisk DetectionSpark
0 likes · 8 min read
LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection
360 Quality & Efficiency
360 Quality & Efficiency
Mar 24, 2020 · Big Data

Understanding Granularity in Data Warehouse Design

This article explains the concept of granularity in data warehouse design, describing data models composed of structures, operations, and constraints, illustrating how granularity affects storage detail, query performance, and resource consumption, and recommending a dual‑granularity approach to balance efficiency and analytical depth.

AnalyticsBig Datadata modeling
0 likes · 5 min read
Understanding Granularity in Data Warehouse Design
Qunar Tech Salon
Qunar Tech Salon
Mar 19, 2020 · Big Data

Apache Kafka Overview: Architecture, Features, and Usage

This article provides a comprehensive introduction to Apache Kafka, covering its high‑throughput distributed architecture, core concepts such as topics, partitions, brokers, producers and consumers, design goals, performance characteristics, deployment steps, configuration, and example code for producers, consumers, and Spring Boot integration.

Big DataDistributed SystemsKafka
0 likes · 39 min read
Apache Kafka Overview: Architecture, Features, and Usage
Youzan Coder
Youzan Coder
Mar 18, 2020 · Big Data

The Evolution of Youzan’s Data Warehouse in a Big Data Environment

The article traces Youzan’s data warehouse from its chaotic early days lacking structure, through a 2016 Airflow‑driven construction phase that introduced layered ODS/DW/Data Mart architecture and naming standards, to a mature stage focused on efficiency, security, SparkSQL, dimensional modeling, metadata, and ongoing real‑time and governance challenges.

AirflowBig DataData Governance
0 likes · 20 min read
The Evolution of Youzan’s Data Warehouse in a Big Data Environment
58 Tech
58 Tech
Mar 16, 2020 · Fundamentals

Understanding Object Serialization: Principles, Frameworks, and Performance Optimizations

This article explains the concept of object serialization, compares generic formats like JSON/XML with binary approaches, discusses optimization principles, key performance metrics, and reviews major serialization frameworks such as Protobuf, Thrift, Hessian, Kryo, and Avro, while also covering TLV encoding, varint algorithms, and practical pitfalls.

Big DataBinaryMicroservices
0 likes · 16 min read
Understanding Object Serialization: Principles, Frameworks, and Performance Optimizations
DevOps
DevOps
Mar 16, 2020 · Operations

JD.com DevOps Case Study: Agile Transformation, Continuous Delivery, and Organizational Practices

This case study examines JD.com’s evolution into a technology‑driven enterprise, detailing its corporate culture, the “ABCDE” technology strategy, the implementation of DevOps and agile practices through the CALMS framework, and how unified continuous‑delivery platforms and operational metrics have driven growth, efficiency, and pandemic response.

Big DataContinuous DeliveryDevOps
0 likes · 16 min read
JD.com DevOps Case Study: Agile Transformation, Continuous Delivery, and Organizational Practices
Top Architect
Top Architect
Mar 13, 2020 · Big Data

Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation

This article presents a comprehensive guide for synchronizing massive MySQL datasets to HBase, covering environment preparation, fast MySQL data loading techniques, and three practical pipelines—Sqoop, Kafka‑Thrift, and Kafka‑Flink—along with performance comparisons and optimization tips for large‑scale data processing.

Big DataFlinkHBase
0 likes · 24 min read
Three Billion‑Scale MySQL‑to‑HBase Synchronization Solutions and Practical Implementation
Meituan Technology Team
Meituan Technology Team
Mar 12, 2020 · Big Data

Data Governance Practices in Meituan Delivery: Architecture, Standards, and Security

Meituan Delivery’s data‑governance framework combines a four‑layer warehouse architecture with comprehensive business, technical, security, and resource‑management standards, continuous metadata and security controls, and tools such as Wherehows and QuickSight, delivering standardized, secure, and easily shareable data while guiding future optimization and emerging‑technology adoption.

Big DataData ArchitectureData Governance
0 likes · 27 min read
Data Governance Practices in Meituan Delivery: Architecture, Standards, and Security
Open Source Linux
Open Source Linux
Mar 12, 2020 · Big Data

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

This tutorial walks you through setting up a three‑node Hadoop 2.9.2 cluster on CentOS 7.5, covering environment preparation, password‑less SSH, user creation, JDK installation, Hadoop extraction, configuration file edits, directory setup, ownership changes, service startup, and verification via web UIs.

Big DataCentOSCluster Setup
0 likes · 13 min read
Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5
Tencent Tech
Tencent Tech
Mar 11, 2020 · Big Data

Scaling the Health Code: Tencent Cloud Elasticsearch at Billion-User Scale

Leveraging Tencent Cloud Elasticsearch, the nationwide COVID‑19 health code platform handled over 1.6 billion scans for more than 900 million users, achieving millisecond‑level search, seamless horizontal scaling, multi‑zone high availability, and robust security, while simplifying development through RESTful APIs and rich UI tools.

Big DataDistributed SystemsElasticsearch
0 likes · 12 min read
Scaling the Health Code: Tencent Cloud Elasticsearch at Billion-User Scale
Alibaba Cloud Developer
Alibaba Cloud Developer
Mar 9, 2020 · Big Data

How Alibaba Digitally Managed 100,000 Employees’ Return to the Office

Alibaba leveraged a suite of digital solutions—including a big‑data entry‑control system, AI‑driven mask detection, smart‑robot meal scheduling, predictive parking, environment regulation, and contactless services—to orchestrate a safe, orderly return of over 100,000 staff across its global campuses.

AIBig DataDigital Transformation
0 likes · 9 min read
How Alibaba Digitally Managed 100,000 Employees’ Return to the Office
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 6, 2020 · Big Data

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

To support over 100 million iQIYI members, the team rebuilt a real‑time log monitoring platform that gathers access, exception, Nginx and front‑end logs via a Venus‑Agent, streams them through Kafka to Spark Streaming and Flink, stores metrics in Druid, and provides minute‑level host and business alerts, achieving 80 % faster incident investigation, detecting 90 % of member complaints early, and generating more than 4,800 actionable alerts.

Big DataFlinkLog Analytics
0 likes · 11 min read
Real-Time Log Monitoring and Alerting for iQIYI Membership Services
Suning Technology
Suning Technology
Mar 5, 2020 · Artificial Intelligence

Will Retail + Internet Healthcare Survive Post‑COVID? Key Insights

After the pandemic, Suning’s Retail Technology Research Institute examines how the convergence of retail and internet medical services can address rising healthcare demand, resource shortages, and infection risks, leveraging big data, AI, and e‑commerce logistics to create integrated, non‑contact medical solutions and new business models.

AIBig DataHealthcare
0 likes · 13 min read
Will Retail + Internet Healthcare Survive Post‑COVID? Key Insights
dbaplus Community
dbaplus Community
Mar 3, 2020 · Big Data

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo's practical experience with Kafka in its big‑data platform, covering three core usage scenarios, a four‑stage evolution roadmap—including version upgrades, resource isolation, security and monitoring—and future plans such as transaction‑based deduplication and consumer throttling.

Big DataKafkaResource Isolation
0 likes · 17 min read
How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices
ITPUB
ITPUB
Mar 2, 2020 · Big Data

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

This article explains ZooKeeper’s architecture, key concepts such as roles, sessions, ZNodes, versioning, ACLs, and watchers, and demonstrates how it powers essential big‑data components like Hadoop’s ResourceManager and HBase’s master election, naming service, and distributed locking.

Big DataDistributed CoordinationHBase
0 likes · 23 min read
Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications
Alibaba Cloud Developer
Alibaba Cloud Developer
Feb 27, 2020 · Databases

How Cloud‑Native Distributed Databases Are Shaping the Future of Enterprise Data

This article reviews the evolution, market trends, core components, architectural challenges, and emerging technologies of cloud‑native distributed database systems, highlighting Alibaba Cloud's solutions such as POLARDB, AnalyticDB, and AI‑driven management platforms that enable elastic, high‑availability, and intelligent data services for modern enterprises.

Alibaba CloudBig DataHTAP
0 likes · 26 min read
How Cloud‑Native Distributed Databases Are Shaping the Future of Enterprise Data
Suning Technology
Suning Technology
Feb 25, 2020 · Operations

How Post-Pandemic Retail Is Reinvented: Trends, Tech, and Opportunities

The Suning Retail Technology Research Institute analyzes post‑COVID retail trends, highlighting shifts in consumer behavior, the rise of product traceability, smart masks, AI‑enabled smart homes, remote work, online healthcare, and community group buying, while outlining the technologies driving these changes.

AIBig Datapost-pandemic
0 likes · 8 min read
How Post-Pandemic Retail Is Reinvented: Trends, Tech, and Opportunities
Suning Technology
Suning Technology
Feb 22, 2020 · Big Data

How SuNing’s Big Data Engine Powers Health‑Code Pandemic Management

During the COVID‑19 pandemic, SuNing launched a public travel information registration system that leverages massive big‑data processing, high‑concurrency architecture, Kafka streaming, and real‑time analytics to create a city‑wide health‑code network, enabling precise epidemic control, mobility tracking, and robust data privacy safeguards.

Big DataHealth Codedata privacy
0 likes · 5 min read
How SuNing’s Big Data Engine Powers Health‑Code Pandemic Management
Qunar Tech Salon
Qunar Tech Salon
Feb 21, 2020 · Artificial Intelligence

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

The article describes how Alibaba's XiaoMi AI platform constructs a closed‑loop pipeline—from data collection and annotation to model training, evaluation, and real‑time deployment—using multi‑dimensional data processing, visualization, and Spark‑based engines to accelerate iterative improvements and address operational pain points.

AIBig DataModel Training
0 likes · 9 min read
Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services
21CTO
21CTO
Feb 19, 2020 · Big Data

Building an Open-Source Big Data Analytics Stack: Challenges & Benefits

The article explains why modern companies rely on data‑driven decisions, outlines the two main challenges of tracking data and connecting it to BI, describes the three‑step analytics stack (integration, warehouse, analysis), and highlights the cost, flexibility, and security advantages of open‑source tools.

Big DataData AnalyticsData Integration
0 likes · 5 min read
Building an Open-Source Big Data Analytics Stack: Challenges & Benefits
MaGe Linux Operations
MaGe Linux Operations
Feb 17, 2020 · Operations

How to Efficiently Split and Merge Large Log Files on Linux

When log files grow massive, traditional tools like vim, cat, grep, and awk become slow and memory‑hungry, but Linux’s split command lets you divide a huge file by line count or size, process the pieces individually, and later recombine them, dramatically improving analysis efficiency.

Big DataShell scriptingfile-handling
0 likes · 8 min read
How to Efficiently Split and Merge Large Log Files on Linux
DataFunTalk
DataFunTalk
Feb 17, 2020 · Artificial Intelligence

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

This article explains how Alibaba’s XiaoMi team constructs a full‑cycle AI pipeline—covering real‑time and offline data processing, high‑dimensional visualization, model training, iterative feedback, and Spark‑based deployment—to accelerate intelligent product iteration while addressing common engineering pain points.

AIBig DataReal-time Processing
0 likes · 10 min read
Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi
Suning Technology
Suning Technology
Feb 15, 2020 · Artificial Intelligence

How AI and Unmanned Tech Are Redefining Retail in the Post‑Pandemic Era

The COVID‑19 pandemic accelerated instant consumption and O2O integration, prompting retailers to adopt AI‑driven unmanned stores, big‑data traceability, smart‑home solutions, and innovative mask and health‑product strategies, reshaping supply chains, operations, and consumer experiences.

AIBig DataCOVID-19
0 likes · 12 min read
How AI and Unmanned Tech Are Redefining Retail in the Post‑Pandemic Era
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewHadoop
0 likes · 11 min read
Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage
Tencent Cloud Developer
Tencent Cloud Developer
Feb 13, 2020 · Big Data

Data Middle Platform: Vision, Architecture, and Business Value

The Data Middle Platform, described by Shi Kai, is a service‑oriented architecture that transforms raw enterprise data into reusable, real‑time APIs for business applications, bridging the gap between traditional warehouses and front‑end systems, accelerating digital transformation through unified governance, rapid development, and direct business value.

Big DataData ArchitectureData Middle Platform
0 likes · 26 min read
Data Middle Platform: Vision, Architecture, and Business Value
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 10, 2020 · Big Data

Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell

This article explains how to use Alibaba's Canal to capture MySQL binlog changes in real time, covering its underlying protocol, component architecture, HA design with ZooKeeper, configuration steps, deployment examples, and a detailed comparison with alternative tools such as Maxwell and mysql_streamer.

Big DataBinlogCanal
0 likes · 17 min read
Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell
58 Tech
58 Tech
Feb 10, 2020 · Big Data

Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com

This article systematically describes the challenges, design principles, modeling methods, layered architecture, implementation steps, and standards used in building a comprehensive user behavior data warehouse for 58.com, highlighting practical experiences and future improvement directions.

Big DataData QualityETL
0 likes · 11 min read
Construction and Practice of a Site-wide User Behavior Data Warehouse at 58.com
HomeTech
HomeTech
Feb 6, 2020 · Product Management

AutoBI One‑Stop Data Visualization Platform: Architecture, Technical Highlights, and Use Cases

The document outlines AutoBI, a company‑wide one‑stop data visualization platform, detailing its background, overall architecture, key technical components such as real‑time/offline data switching and query processing, integration capabilities, and practical case studies, highlighting efficiency gains and future development plans.

BackendBig DataDashboard
0 likes · 8 min read
AutoBI One‑Stop Data Visualization Platform: Architecture, Technical Highlights, and Use Cases
Youzan Coder
Youzan Coder
Feb 5, 2020 · Backend Development

Configurable Data Reconciliation Platform at Youzan: Design, Architecture, and Implementation

Youzan built a configurable data reconciliation platform that integrates new scenarios, processes massive real‑time and batch data, offers visual monitoring, automated correction, and flexible Groovy‑based logic across four DDD layers, achieving 99.99% stability while simplifying detection and resolution of cross‑system inconsistencies.

Big DataData ReconciliationDistributed Systems
0 likes · 15 min read
Configurable Data Reconciliation Platform at Youzan: Design, Architecture, and Implementation
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 30, 2020 · Big Data

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

This article provides an in‑depth, step‑by‑step guide to optimizing Spark jobs, covering development‑time best practices, resource‑parameter tuning, data‑skew detection and mitigation techniques, and shuffle‑stage performance tweaks, complete with Scala code examples and practical recommendations.

Big DataData SkewResource Tuning
0 likes · 67 min read
Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 20, 2020 · Big Data

Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing

This article details how Alibaba migrated its massive Taobao‑Tmall search workload to the search offline platform, tackling challenges of massive data volume, one‑to‑many joins, and hotspot sellers through a series of performance optimizations—including local joins, salt‑based data sharding, dynamic aggregation jobs, and asynchronous processing—to achieve high‑throughput full loads and low‑latency incremental updates.

AlibabaBig DataFlink
0 likes · 15 min read
Alibaba’s Secrets to High‑Throughput Full‑Load and Low‑Latency Search Processing
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 19, 2020 · Big Data

Tencent's Elasticsearch Practices: Application Scenarios, Challenges, Optimizations, and Future Directions

This article details how Tencent leverages Elasticsearch for log analysis, search services, and time‑series data, outlines the specific challenges faced in high‑availability and cost‑efficiency, and presents the comprehensive optimization techniques and future open‑source contributions that improve performance, scalability, and reliability.

Big DataCost OptimizationElasticsearch
0 likes · 16 min read
Tencent's Elasticsearch Practices: Application Scenarios, Challenges, Optimizations, and Future Directions
Tencent Cloud Developer
Tencent Cloud Developer
Jan 19, 2020 · Backend Development

Tencent Kona JDK: OpenJDK Foundations, Technical Trends, and Big Data Practices

The talk reviews OpenJDK’s evolution, contrasts Oracle JDK, introduces Tencent’s Kona JDK as a free, long‑term, production‑hardened fork optimized for massive micro‑service and big‑data workloads, and discusses emerging Java‑on‑Java, value‑type, Project Panama/Loom, and SIMD Vector API trends shaping JVM performance.

Big DataCloud ComputingJVM
0 likes · 15 min read
Tencent Kona JDK: OpenJDK Foundations, Technical Trends, and Big Data Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 16, 2020 · Big Data

Kafka Interview Guide: Core Concepts, Architecture, and Practical Tips

This article compiles essential Kafka interview material, covering its role as a message queue, usage scenarios, architectural components, storage mechanisms, consumer group rebalancing, high‑availability features, replication details, ordering guarantees, producer/consumer client design, topic management, log retention, performance optimizations, and key monitoring metrics.

Big DataDistributed SystemsKafka
0 likes · 16 min read
Kafka Interview Guide: Core Concepts, Architecture, and Practical Tips
Architects Research Society
Architects Research Society
Jan 16, 2020 · Big Data

Elasticsearch vs Solr: Choosing the Right Open‑Source Search Engine

This article compares Elasticsearch and Solr, examining their history, community, licensing, core technologies, APIs, scalability, vendor support, ecosystem, performance, management tools, and visualization options to help organizations decide which open‑source search engine best fits their big‑data and search requirements.

Big DataElasticsearchSolr
0 likes · 12 min read
Elasticsearch vs Solr: Choosing the Right Open‑Source Search Engine
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 10, 2020 · Big Data

Async I/O for Dimension Table Joins in Apache Flink

This article explains how to handle dimension table joins in Apache Flink streaming by leveraging Async I/O to perform non‑blocking external lookups, provides detailed code examples for both synchronous and asynchronous functions, discusses configuration parameters, and outlines best practices and pitfalls.

Big DataDimension Table JoinFlink
0 likes · 16 min read
Async I/O for Dimension Table Joins in Apache Flink
ITPUB
ITPUB
Jan 10, 2020 · Big Data

How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo’s practical experience using Kafka across three core scenarios—real‑time storage, analytical data source, and business data subscription—while describing a four‑stage evolution that includes version upgrades, resource isolation, security and monitoring enhancements, and a comprehensive subscription platform, followed by future improvement plans.

Big DataData ReplayKafka
0 likes · 16 min read
How MaFengWo Scales Kafka for Real‑Time Big Data: Lessons and Best Practices
DataFunTalk
DataFunTalk
Jan 9, 2020 · Databases

Exploring Spatiotemporal Data Management with Cassandra, GeoMesa, and GeoTrellis

This article presents a comprehensive overview of handling spatiotemporal data using Cassandra, covering data types, space‑filling curves, GeoHash encoding, the GeoMesa and GeoTrellis ecosystems, Cassandra storage schemas, and practical Spark integration for large‑scale geospatial analytics.

Big DataGeoMesaGeoTrellis
0 likes · 8 min read
Exploring Spatiotemporal Data Management with Cassandra, GeoMesa, and GeoTrellis
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 9, 2020 · Big Data

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

iQIYI’s Real‑Time Analysis Platform (RAP) combines Apache Druid with Spark/Flink to deliver minute‑level, low‑latency multidimensional analytics via a web wizard, supporting hundreds of streaming tasks and thousands of reports across membership, recommendation, and TV monitoring, while simplifying development and maintenance.

Apache DruidBig DataFlink
0 likes · 13 min read
Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Jan 7, 2020 · Big Data

Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn

The article describes the evolution from the legacy XDATA tool to the new XFlink system, detailing its architecture, core plugins, parser and deployment modules, resource management with Yarn, monitoring via Prometheus and Grafana, and planned enhancements such as Flink SQL configuration and modular plugins.

Big DataData MigrationDistributed Systems
0 likes · 10 min read
Design and Implementation of XFlink: A Flink‑Based Data Migration System on Yarn
dbaplus Community
dbaplus Community
Jan 6, 2020 · Big Data

How 58.com Built a Scalable Flink‑Based Real‑Time Data Platform (Wstream)

The article details how 58.com designed and evolved its one‑stop real‑time computation platform Wstream, migrating from Storm and Spark Streaming to Apache Flink, and describes the architecture, task isolation, stream‑SQL features, monitoring, and ongoing optimizations that enable processing of over 600 billion records daily.

Big DataFlinkReal-time Streaming
0 likes · 12 min read
How 58.com Built a Scalable Flink‑Based Real‑Time Data Platform (Wstream)
Tencent Cloud Developer
Tencent Cloud Developer
Jan 6, 2020 · Big Data

Overview of TubeMQ: Principles, Architecture, Performance, and Open‑Source Strategy for Big‑Data Message Queues

TubeMQ is a trillion‑level, Java‑based distributed message‑queue middleware designed for massive‑data ingestion, offering 140 k TPS with sub‑5 ms latency, high reliability, low cost, and horizontal scalability, and is being open‑sourced to the Apache foundation to foster community collaboration and future expansion beyond traditional MQ functions.

Big DataDistributed SystemsMessage Queue
0 likes · 15 min read
Overview of TubeMQ: Principles, Architecture, Performance, and Open‑Source Strategy for Big‑Data Message Queues
58 Tech
58 Tech
Jan 6, 2020 · Big Data

Design and Architecture of the 58DP Big Data Platform Task Scheduling System

The article presents a comprehensive overview of the 58DP big data platform's task scheduling system, detailing its background, architecture, high‑availability design, slot‑based resource management, scheduling models, task lifecycle, priority rules, dependency handling, failure recovery, and future enhancements.

Big DataResource Managementdistributed system
0 likes · 14 min read
Design and Architecture of the 58DP Big Data Platform Task Scheduling System
Didi Tech
Didi Tech
Jan 5, 2020 · Big Data

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

The team performed a rolling upgrade of HDFS from 2.7 to 3.2 on large clusters, resolving EditLog, Fsimage, StringTable and authentication incompatibilities by omitting EC data, using fallback images, rolling back commits and first upgrading to the latest 2.x release, following a staged JournalNode‑NameNode‑DataNode procedure, validating with rehearsals and a custom trash‑management tool, and achieving uninterrupted service, improved stability, performance and cost efficiency.

Big DataCluster MigrationHDFS
0 likes · 11 min read
Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2020 · Big Data

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

This article provides a comprehensive overview of Apache Spark Structured Streaming, describing its declarative API, the challenges of stream processing, the programming model with code examples, query planning, execution modes, production use cases, and performance benchmarks compared with other streaming systems.

Big DataSparkStreaming
0 likes · 42 min read
Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation
DataFunTalk
DataFunTalk
Jan 2, 2020 · Big Data

ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations

This article presents an in‑depth overview of ByteDance’s large‑scale HDFS deployment, describing its unique access layer, metadata and data layers, the evolution through multiple growth stages, and the key architectural improvements such as NNProxy, DanceNN, lock redesign, startup acceleration, and slow‑node mitigation techniques.

Big DataByteDanceFederation
0 likes · 18 min read
ByteDance’s HDFS Architecture and Evolution: Design, Challenges, and Optimizations
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Dec 31, 2019 · Big Data

Apache Kylin Overview and Model Optimization Practices for Trajectory Analytics

This article introduces Apache Kylin, details its deployment at Tongcheng Yilong, explains the design of a large‑scale trajectory model, and provides step‑by‑step optimization techniques—including cube dimension reduction, HBase rowkey tuning, build parameter tweaks, high‑cardinality handling, and query compression disabling—to achieve sub‑second OLAP queries on multi‑terabyte data.

Apache KylinBig DataCube
0 likes · 17 min read
Apache Kylin Overview and Model Optimization Practices for Trajectory Analytics
DataFunTalk
DataFunTalk
Dec 30, 2019 · Databases

Cassandra: Past, Present, and Future – History, Architecture, Features, and Use Cases

This article summarizes a Cassandra meetup presentation that traces the database's origins from BigTable and Dynamo, outlines its key milestones, explains its peer‑to‑peer and LSM architecture, highlights current features, real‑world deployments, performance advantages, and previews upcoming 4.0 releases and community projects.

Big DataGossip ProtocolLSM
0 likes · 14 min read
Cassandra: Past, Present, and Future – History, Architecture, Features, and Use Cases
Java High-Performance Architecture
Java High-Performance Architecture
Dec 29, 2019 · Fundamentals

Which Technologies Will Dominate Software Development in 2020? A Trend Forecast

This article forecasts the 2020 software development landscape, highlighting the rise of cloud adoption, Kubernetes, micro‑services, Python, Java, emerging languages like Rust and Kotlin, JavaScript frameworks, API standards, SQL dominance, big‑data engines Spark and Flink, and the growing impact of WebAssembly.

Big DataCloud ComputingMicroservices
0 likes · 9 min read
Which Technologies Will Dominate Software Development in 2020? A Trend Forecast
Efficient Ops
Efficient Ops
Dec 28, 2019 · Operations

What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends

The 2019 Enterprise IT Operations Whitepaper, released at the national Operations Conference, systematically examines the definition, value, key capabilities, industry applications, challenges, and future trends of IT operations across telecom, finance, Internet, and manufacturing sectors.

Artificial IntelligenceBig DataIT Operations
0 likes · 6 min read
What the 2019 IT Operations Whitepaper Reveals About Enterprise Ops Trends
ITPUB
ITPUB
Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataReliabilitySpark
0 likes · 16 min read
How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains
21CTO
21CTO
Dec 26, 2019 · Artificial Intelligence

Will AI and Machine Learning Redefine Software Testing in 2020?

The article outlines five major 2020 software testing trends—including the surge of AI/ML, digital transformation, cloud and IoT adoption, the shift from performance testing to performance engineering, and the growing importance of big‑data testing—highlighting their impact on quality assurance practices.

AIBig DataCloud Computing
0 likes · 7 min read
Will AI and Machine Learning Redefine Software Testing in 2020?
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 25, 2019 · Big Data

Understanding Flink StreamPartitioner and Its Implementations

Flink’s StreamPartitioner abstracts data routing in DataStream, offering eight built‑in partitioners—including Global, Shuffle, Rebalance, KeyGroup, Broadcast, Rescale, Forward, and Custom—each with distinct channel selection logic, illustrated with source code snippets and explanations of their runtime behavior.

Big DataDataStreamFlink
0 likes · 8 min read
Understanding Flink StreamPartitioner and Its Implementations
DataFunTalk
DataFunTalk
Dec 24, 2019 · Big Data

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

This article explains PySpark's multi‑process architecture, how the Python driver uses Py4J to call Java/Scala APIs, the implementation of RDD and DataFrame interfaces, executor‑side process communication and serialization with Arrow, and the design of Pandas UDFs, while also discussing current limitations and future directions.

ArrowBig DataPySpark
0 likes · 13 min read
Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF
dbaplus Community
dbaplus Community
Dec 23, 2019 · Databases

How to Deploy, Scale, and Monitor ClickHouse for High‑Performance Big Data Analytics

This article explains ClickHouse's deployment architecture, read‑write separation, shard expansion steps, write‑batch strategies, a three‑layer monitoring model, and its practical application in Tencent's game analytics platform, offering concrete guidance for building a stable, high‑throughput analytics service.

Big DataDeploymentGame Analytics
0 likes · 21 min read
How to Deploy, Scale, and Monitor ClickHouse for High‑Performance Big Data Analytics
DataFunTalk
DataFunTalk
Dec 23, 2019 · Databases

Cassandra Deployment and Optimization at 360 Cloud Storage

This article details how 360 adopted Cassandra for its cloud drive, describing Cassandra’s decentralized architecture, the reasons for its selection over HBase, large‑scale deployment challenges, performance optimizations, reliability improvements, disk utilization techniques, and the evolution of the system from 2010 to present.

Big DataCloud StorageData Reliability
0 likes · 15 min read
Cassandra Deployment and Optimization at 360 Cloud Storage
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 22, 2019 · Big Data

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

The article explains Spark's default static resource allocation, analyzes the limitations of its Dynamic Resource Allocation (DRA) for streaming workloads, describes the internal Spark components and code paths involved, and proposes concrete design and configuration recommendations for implementing more responsive executor scaling.

Big DataDynamic Resource AllocationExecutor Management
0 likes · 11 min read
Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 21, 2019 · Big Data

Kafka Offset Management and Replication Mechanisms Explained

This article provides a comprehensive technical overview of Kafka's offset handling, covering the request entry point, in‑memory offset sources, offset commit and fetch implementations, file storage layout, and the leader‑follower synchronization process that ensures data replication and high‑watermark updates.

Big DataDistributed SystemsHigh Watermark
0 likes · 16 min read
Kafka Offset Management and Replication Mechanisms Explained