Tagged articles
3675 articles
Page 31 of 37
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 1, 2019 · Big Data

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big DataHDFSHadoop
0 likes · 15 min read
Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 29, 2019 · Big Data

Weekly Knowledge Digest: Apache Flink Deep Dives on JOIN LATERAL, TimeInterval, Temporal Table, and State Management

This week's digest shares a personal anecdote and a series of technical deep‑dives into Apache Flink, covering JOIN LATERAL, TimeInterval JOIN, Temporal Table JOIN, state management, and related code examples, while also previewing upcoming work schedules and recommended Flink reference articles.

Apache FlinkBig DataSQL Join
0 likes · 5 min read
Weekly Knowledge Digest: Apache Flink Deep Dives on JOIN LATERAL, TimeInterval, Temporal Table, and State Management
dbaplus Community
dbaplus Community
Mar 27, 2019 · Big Data

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

Big DataData SkewGC tuning
0 likes · 12 min read
How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization
Tencent Cloud Developer
Tencent Cloud Developer
Mar 27, 2019 · Industry Insights

How AI and Big Data Drive New Engineering Education: Insights from the 2019 IT Alliance Conference

The 2019 Information Technology New Engineering Alliance conference in Beijing gathered academia, research institutes, and industry leaders to discuss AI, big data, and curriculum innovation, highlighting Tencent's contributions to digital education, cloud certification, and the broader push for industry‑university collaboration in shaping future IT talent.

AIBig DataCloud Computing
0 likes · 6 min read
How AI and Big Data Drive New Engineering Education: Insights from the 2019 IT Alliance Conference
NetEase Game Operations Platform
NetEase Game Operations Platform
Mar 27, 2019 · Big Data

Embedding Python in Java with Jython for Real‑Time Big Data Jobs

This article explains why and how to embed Python code in Java using Jython for real‑time big‑data processing, covering performance benefits, memory‑leak pitfalls, singleton interpreter patterns, function factories, Java‑object conversion, and importing external PyPI packages with practical code examples.

Big DataDynamic LanguageEmbedding
0 likes · 11 min read
Embedding Python in Java with Jython for Real‑Time Big Data Jobs
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 22, 2019 · Big Data

Weekly Knowledge Points: Apache Flink Continuous Queries, Kafka Connectors, SQL Overview, JOIN Operator, and Table API

This weekly briefing introduces Apache Flink's continuous query mechanism, demonstrates how to integrate Kafka as a DataStream connector, provides an overview of Flink SQL features, explains the implementation and optimization of dual‑stream JOIN operators, and showcases the Table API with end‑to‑end examples.

Apache FlinkBig DataTable API
0 likes · 3 min read
Weekly Knowledge Points: Apache Flink Continuous Queries, Kafka Connectors, SQL Overview, JOIN Operator, and Table API
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 21, 2019 · Big Data

Apache Flink Table API Tutorial and End‑to‑End Examples

This article provides a comprehensive tutorial on Apache Flink's Table API, explaining its concepts, core features, and a wide range of operators such as SELECT, WHERE, GROUP BY, UNION, JOIN, and various window functions, while offering complete Scala code examples, custom sources, sinks, and an end‑to‑end job that computes page‑view counts per region using event‑time tumbling windows.

Big DataFlinkScala
0 likes · 36 min read
Apache Flink Table API Tutorial and End‑to‑End Examples
Architects' Tech Alliance
Architects' Tech Alliance
Mar 21, 2019 · Cloud Computing

Understanding the Chinese Enterprise IT Landscape: Market Structure, Demand Drivers, and Technology Trends

This article analyzes China's massive enterprise ecosystem, the composition of its IT market, the human and political factors shaping demand, and how cloud computing, big data, and artificial intelligence are driving a new wave of digital transformation across state‑owned, internet, and other enterprises.

Artificial IntelligenceBig DataChina
0 likes · 14 min read
Understanding the Chinese Enterprise IT Landscape: Market Structure, Demand Drivers, and Technology Trends
Tencent Cloud Developer
Tencent Cloud Developer
Mar 20, 2019 · Big Data

TVP Training Camp: Exploring Big Data Technologies and Trends

The inaugural TVP Training Camp on March 16 2019 in Beijing gathered Tencent Cloud’s TVP members and leading big‑data experts to discuss emerging technologies such as Greenplum, PMEM‑driven infrastructure, data‑operation optimization, and next‑generation cloud databases, while a round‑table addressed practical challenges and affirmed Tencent’s commitment to ongoing expert collaboration.

Big DataCloud ComputingData Analytics
0 likes · 11 min read
TVP Training Camp: Exploring Big Data Technologies and Trends
Youzan Coder
Youzan Coder
Mar 20, 2019 · Big Data

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Youzan’s real‑time computing platform progressed from early Storm deployments through Spark Streaming to a Flink‑based architecture, adding unified task management, monitoring, and dedicated streaming clusters, while now pursuing SQL‑driven jobs, a Druid OLAP engine, and a future real‑time data warehouse.

Big DataFlinkSpark Streaming
0 likes · 14 min read
Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 19, 2019 · Big Data

Comprehensive Overview of SQL and Apache Flink SQL Features with Practical Code Examples

This article provides an in-depth introduction to SQL, its history and ANSI standards, then details Apache Flink's SQL capabilities—including SELECT, WHERE, GROUP BY, UNION, JOIN, window functions, and user-defined functions—accompanied by extensive code examples and a complete end‑to‑end Flink job implementation.

Apache FlinkBig DataStreaming
0 likes · 34 min read
Comprehensive Overview of SQL and Apache Flink SQL Features with Practical Code Examples
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 17, 2019 · Big Data

Understanding Continuous Queries in Apache Flink: From Static Queries to Dynamic Tables and Trigger Simulations

This article explains how Apache Flink implements continuous queries for unbounded stream processing, compares static and continuous query semantics, demonstrates how MySQL triggers can simulate continuous queries in append‑only and update scenarios, and discusses Flink's connector, source, sink, and retraction mechanisms for correct incremental computation.

Apache FlinkBig DataContinuous Query
0 likes · 18 min read
Understanding Continuous Queries in Apache Flink: From Static Queries to Dynamic Tables and Trigger Simulations
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 13, 2019 · Big Data

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

This article explains Apache Flink's fault‑tolerance mechanisms, including checkpointing, barrier alignment, the differences between At‑Least‑Once and Exactly‑Once semantics, configuration options, incremental checkpointing, and the requirements for external sources and sinks to achieve end‑to‑end exactly‑once processing.

Apache FlinkBig DataExactly-Once
0 likes · 15 min read
Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink
JD Tech
JD Tech
Mar 13, 2019 · Operations

Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3

The article chronicles the design, evolution, and lessons learned of JD Digital Technology’s self‑built host monitoring platform “DiTing”, detailing its initial requirements, V1 architecture, subsequent V2 and V3 redesigns, encountered challenges, and future directions toward intelligent operations.

Big DataOperationsSystem Architecture
0 likes · 12 min read
Evolution of JD Digital Technology’s Host Monitoring System “DiTing”: From V1 to V3
dbaplus Community
dbaplus Community
Mar 12, 2019 · Databases

Mastering HBase Cross‑Datacenter Migration: Snapshots, Architecture, and Real‑World Tips

This article provides a comprehensive technical guide on HBase, covering its core concepts, advantages and drawbacks, architecture layers, practical use cases, and a detailed step‑by‑step process for large‑scale cross‑datacenter migration using snapshot‑based strategies, with commands, diagrams, and lessons learned.

Big DataData MigrationDatabase Architecture
0 likes · 19 min read
Mastering HBase Cross‑Datacenter Migration: Snapshots, Architecture, and Real‑World Tips
DataFunTalk
DataFunTalk
Mar 11, 2019 · Artificial Intelligence

Practical Implementation of Personalized Recommendation Systems: Overview, Algorithms, Challenges, and Architecture

This article presents a comprehensive overview of personalized recommendation systems, covering their purpose, common algorithms, development challenges, the multi‑layer architecture used at DataGrand, optimization techniques, and the range of services offered to enterprise customers.

Big Datacollaborative filteringmachine learning
0 likes · 18 min read
Practical Implementation of Personalized Recommendation Systems: Overview, Algorithms, Challenges, and Architecture
DataFunTalk
DataFunTalk
Mar 7, 2019 · Big Data

Design and Evolution of Didi's Real‑Time Data Computing Platform

The article details how Didi built and iterated its real‑time data platform, describing the shift from MySQL‑based batch processing to a Kafka‑Samza‑Druid architecture with Spark Streaming and Flink, the challenges addressed, and the current capabilities and operational metrics.

Big DataDruidFlink
0 likes · 9 min read
Design and Evolution of Didi's Real‑Time Data Computing Platform
58 Tech
58 Tech
Mar 7, 2019 · Big Data

In-Memory Inverted Index Compression Algorithms: Overview and MILC Optimization for High‑Performance Search

This article reviews major in‑memory inverted index compression techniques such as PForDelta, PEF, and MILC, explains their principles and trade‑offs, and details practical optimizations applied at 58.com to achieve query performance comparable to uncompressed indexes while reducing memory usage by about 35 percent.

Big DataMILCalgorithm
0 likes · 17 min read
In-Memory Inverted Index Compression Algorithms: Overview and MILC Optimization for High‑Performance Search
AntTech
AntTech
Mar 6, 2019 · Databases

How Ant Financial Scaled the 2019 Alipay New Year Red Envelope Event with GeaBase Graph Database and Real‑Time Data Intelligence

The 2019 Alipay New Year "Five Blessings" red‑envelope campaign, serving 450 million users, leveraged Ant Financial's GeaBase distributed graph database, a real‑time data‑intelligence platform, and OceanBase elastic resources to achieve millisecond‑level ranking, seconds‑level transaction audit, and seamless high‑concurrency performance.

AlipayBackendBig Data
0 likes · 10 min read
How Ant Financial Scaled the 2019 Alipay New Year Red Envelope Event with GeaBase Graph Database and Real‑Time Data Intelligence
HomeTech
HomeTech
Feb 28, 2019 · Artificial Intelligence

How to Systematically Test and Monitor AI Models in Large‑Scale Production

This article presents a comprehensive approach to testing, automating, and monitoring AI prediction models in a high‑traffic environment, covering background, challenges, evaluation metrics, data sampling methods, automated test scripts, and online monitoring to ensure model accuracy, performance, and reliability.

AI testingBig DataMetrics
0 likes · 13 min read
How to Systematically Test and Monitor AI Models in Large‑Scale Production
Xianyu Technology
Xianyu Technology
Feb 28, 2019 · Big Data

NVID Recommendation System Architecture and Technical Solutions

The NVID recommendation system for Taobao is built on a four‑layer architecture—activity material, configuration, business process, and application—and solves environment isolation, performance, audience management, and A/B testing challenges through optimized data schemas, ID mapping, multi‑level caching with database fallback, and real‑time user targeting, while future work aims at personalized audiences and automated ad optimization.

A/B testingBig DataSystem Architecture
0 likes · 11 min read
NVID Recommendation System Architecture and Technical Solutions
AntTech
AntTech
Feb 27, 2019 · Big Data

Ant Financial Data Governance: Practices and Challenges in Data Quality Management

The article details Ant Financial’s comprehensive data quality governance framework, covering its architecture, challenges, implementation strategies, and real‑world case studies, illustrating how the company integrates data monitoring, AI‑driven self‑healing, and rigorous release controls to ensure high‑quality data across its platform.

Ant FinancialBig DataData Governance
0 likes · 17 min read
Ant Financial Data Governance: Practices and Challenges in Data Quality Management
Qunar Tech Salon
Qunar Tech Salon
Feb 27, 2019 · Databases

Evolution of Meituan’s Database Platform: From Manual Operations to Intelligent Automation

This article outlines Meituan’s transition of its database platform from manual, script‑based operations through tool‑ and product‑centric stages to a private‑cloud and automation era, discusses current challenges such as root‑cause analysis and staffing, and shares insights on moving toward fully intelligent, data‑driven database operations.

Big DataCloud ComputingIntelligent Operations
0 likes · 13 min read
Evolution of Meituan’s Database Platform: From Manual Operations to Intelligent Automation
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 26, 2019 · Big Data

Deploying Apache Flink Clusters: Standalone and YARN Modes

This guide explains how to set up an Apache Flink cluster on CentOS 7 using three deployment methods—Local, Standalone, and Flink on YARN/Kubernetes—including host configuration, SSH setup, package distribution, configuration file editing, cluster start/stop commands, YARN resource manager concepts, session commands, job submission, fault‑tolerance settings, and log inspection.

Big DataCluster DeploymentFlink
0 likes · 11 min read
Deploying Apache Flink Clusters: Standalone and YARN Modes
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 25, 2019 · Big Data

Understanding Flink DataSetAPI and DataStreamAPI

This article introduces Apache Flink's DataSetAPI and DataStreamAPI, explains their source, transformation, and sink concepts, highlights the key differences in transformation handling, and notes the series' goal of publishing over 500 big‑data tutorials for learners from beginner to expert.

Big DataDataSetAPIDataStreamAPI
0 likes · 2 min read
Understanding Flink DataSetAPI and DataStreamAPI
Vipshop Quality Engineering
Vipshop Quality Engineering
Feb 22, 2019 · Artificial Intelligence

How Vipshop Built an AI‑Powered Sentiment Analysis System for Real‑Time Customer Feedback

Vipshop's in‑house sentiment monitoring platform integrates web‑scraped reviews, WeChat comments and internal service messages, applying lexical sentiment scoring, dictionary‑based Chinese word segmentation, TF‑IDF keyword ranking and lightweight classification to deliver real‑time insights, alerts and actionable reports for thousands of daily user comments.

Big DataNLPSentiment Analysis
0 likes · 17 min read
How Vipshop Built an AI‑Powered Sentiment Analysis System for Real‑Time Customer Feedback
Beike Product & Technology
Beike Product & Technology
Feb 21, 2019 · Big Data

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

The article presents an in‑depth overview of the DATABUS data integration platform, detailing its background, current challenges, core capabilities such as data syncing, metadata automation, real‑time subscriptions, and its reliance on TiDB, TiSpark, Hudi, and related big‑data technologies to enable near‑real‑time data warehousing.

Big DataData IntegrationHudi
0 likes · 13 min read
DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 20, 2019 · Big Data

Zookeeper: The Core Coordination Service in Big Data Systems

Zookeeper, originally a side‑project of Hadoop, is a Yahoo‑developed distributed coordination framework that provides high‑availability services such as configuration management, distributed locks, and failure handling, and has become a foundational component for many big‑data systems like Hadoop, Kafka, and Dubbo.

Big DataConfiguration ManagementCoordination Service
0 likes · 3 min read
Zookeeper: The Core Coordination Service in Big Data Systems
Sohu Tech Products
Sohu Tech Products
Feb 13, 2019 · Big Data

Evolution and Implementation Details of Spark Shuffle Mechanisms

This article examines the historical evolution of Spark's shuffle implementations—from early Hash‑Based Shuffle to modern SortShuffleWriter, BypassMergeSortShuffleWriter, and UnsafeShuffleWriter—explaining their design choices, selection criteria, and the corresponding shuffle reader architecture in a production‑grade Spark 2.1.1 environment.

Big DataShuffleShuffle Writer
0 likes · 13 min read
Evolution and Implementation Details of Spark Shuffle Mechanisms
Ctrip Technology
Ctrip Technology
Feb 13, 2019 · R&D Management

Ctrip’s Technology Evolution: From Call‑Center Era to Big Data and AI

The article outlines Ctrip’s three‑phase technology evolution—from a simple call‑center architecture to layered internet and mobile platforms, and finally to a cloud‑based big‑data and AI‑driven ecosystem—highlighting architectural changes, operational challenges, and strategic lessons for fast‑growing internet companies.

Big DataCtripR&D management
0 likes · 13 min read
Ctrip’s Technology Evolution: From Call‑Center Era to Big Data and AI
Youzan Coder
Youzan Coder
Feb 1, 2019 · Big Data

Design and Implementation of Log Parsing for a Big Data Offline Task Platform

The article describes a log‑parsing feature for Youzan’s big‑data offline platform that captures runtime logs from Hive, Spark, DataX, MapReduce and HBase jobs, categorizes scheduling types, extracts metrics such as read/write bytes, shuffle volume and GC time, and processes them in real time via a Filebeat‑Logstash‑Kafka‑Spark‑Streaming pipeline storing results in Redis for monitoring, optimization and resource‑usage ranking.

Big DataResource MonitoringYARN
0 likes · 7 min read
Design and Implementation of Log Parsing for a Big Data Offline Task Platform
Didi Tech
Didi Tech
Jan 31, 2019 · Big Data

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

Router‑Based Federation replaces Hadoop’s single‑point HDFS bottleneck with a server‑side global namespace managed by Routers and a State Store, enabling scalable, highly available sub‑clusters; Didi back‑ported the feature, deployed five Routers, fixed numerous bugs, and contributed patches to improve stability and functionality.

Big DataHDFSHadoop
0 likes · 11 min read
Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment
DataFunTalk
DataFunTalk
Jan 30, 2019 · Artificial Intelligence

Real‑Time Metrics Processing Technology for Financial Risk Control and Anti‑Fraud

This article outlines the challenges of financial risk control in the internet era and presents a comprehensive real‑time metrics processing system, covering data leakage, fraud, big‑data opportunities, AI model deployment, and the technical architecture of the Bangsheng real‑time indicator platform.

AIBig Dataanti‑fraud
0 likes · 17 min read
Real‑Time Metrics Processing Technology for Financial Risk Control and Anti‑Fraud
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jan 29, 2019 · Operations

How to Optimize Large-Scale Log Systems for Real-Time Monitoring and Scalability

This article examines the design, deployment, and optimization of massive log systems, comparing architectures, discussing real‑time versus near‑real‑time requirements, and presenting practical improvements such as memory, CPU, network tuning, data partitioning, storage reduction, and component upgrades using ELK, Kafka, Fluentd, and HBase.

Big DataELKFluentd
0 likes · 18 min read
How to Optimize Large-Scale Log Systems for Real-Time Monitoring and Scalability
21CTO
21CTO
Jan 26, 2019 · Big Data

Data Lake vs Data Warehouse: Which One Powers Your Business?

This article explains the core differences between data lakes and data warehouses, their respective strengths, and how they complement each other to support both exploratory analytics and routine business reporting.

AnalyticsBig DataData Lake
0 likes · 5 min read
Data Lake vs Data Warehouse: Which One Powers Your Business?
NetEase Game Operations Platform
NetEase Game Operations Platform
Jan 25, 2019 · Big Data

Understanding Exactly-Once Semantics in Apache Flink: Challenges and Implementation

This article analyzes the difficulties of achieving exactly-once delivery in Apache Flink, explains the distinction between state and end‑to‑end semantics, and details how idempotent and transactional sinks—illustrated with the Bucketing File Sink—realize exactly‑once guarantees through checkpoint‑based two‑phase commit.

Big DataExactly-OnceFlink
0 likes · 13 min read
Understanding Exactly-Once Semantics in Apache Flink: Challenges and Implementation
dbaplus Community
dbaplus Community
Jan 23, 2019 · Big Data

How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX

This article explains Zhihu's journey from ad‑hoc MySQL‑Hive sync using Oozie + Sqoop to a unified, platform‑based data synchronization service that now handles thousands of tables, over 10 TB daily, with load‑aware scheduling, incremental pulls, schema change handling, and tight integration with their offline job scheduler.

Big DataDataXETL
0 likes · 14 min read
How Zhihu Built a Scalable Data‑Sync Platform with Sqoop and DataX
21CTO
21CTO
Jan 23, 2019 · Big Data

Can 1.4 Billion Users Fit Into One WeChat Group? A Technical Feasibility Study

This article analyzes whether the entire Chinese population could be added to a single WeChat group, examining user statistics, message volume, required bandwidth, CPU processing limits, Moore's law projections, supercomputer alternatives, hardware costs, storage demands, and practical challenges, concluding that it is theoretically possible but practically infeasible.

Big DataPerformanceServer
0 likes · 10 min read
Can 1.4 Billion Users Fit Into One WeChat Group? A Technical Feasibility Study
MaGe Linux Operations
MaGe Linux Operations
Jan 23, 2019 · Big Data

How Bloom Filters Power Fast Big Data Searches with Python

This tutorial walks through building a simple Python search engine for big data, covering Bloom filter basics, tokenization with major and minor segmentation, inverted index creation, and implementing both simple and complex (AND/OR) queries, complete with code examples and visual illustrations.

AND/OR queriesBig DataPython
0 likes · 15 min read
How Bloom Filters Power Fast Big Data Searches with Python
Tencent Cloud Developer
Tencent Cloud Developer
Jan 17, 2019 · Artificial Intelligence

Deep Learning for Big Data Recommendation Systems: Tencent's Industrial Practice

Tencent’s industrial practice shows how a large‑scale offline‑nearline‑online “Shield” recommendation architecture, powered by the DeepR framework built on RCaffe, uses deep semantic embeddings, massive neural networks and reinforcement‑learning decisions to handle billions of daily requests, demonstrating that data richness and engineering capability, not model depth alone, drive performance in big‑data recommendation systems.

Big DataDeep LearningNeural Network
0 likes · 13 min read
Deep Learning for Big Data Recommendation Systems: Tencent's Industrial Practice
JD Tech
JD Tech
Jan 17, 2019 · Operations

Technical Overview of JD's Archimedes Resource Scheduling System

The article presents a detailed technical analysis of JD's Archimedes project, describing its evolution from JDOS 2.0 to a large‑scale container scheduling platform that dramatically improves resource utilization, deployment speed, and cost efficiency across JD’s data centers.

AIBig DataJD
0 likes · 6 min read
Technical Overview of JD's Archimedes Resource Scheduling System
Youzan Coder
Youzan Coder
Jan 16, 2019 · Big Data

How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons

This article walks through Youzan's real‑time platform architecture, explains why Flink was chosen over Spark Structured Streaming, details practical challenges such as container over‑provisioning and monitoring overhead, shares solutions for Spring integration and async caching, and outlines future directions for SQL‑based streaming and scheduler improvements.

Big DataFlinkReal-time Streaming
0 likes · 19 min read
How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jan 16, 2019 · Big Data

What’s New in Transwarp TDH 5.2.3? Key Performance and Stability Enhancements

TDH 5.2.3 introduces a series of stability and performance upgrades—including transaction and compaction optimizations, enhanced error handling, SQL length protection, improved Oracle‑compatible UDFs, default resource pool support, Guardian caching, TxSQL monitoring, and workflow and OLAP engine fixes—aimed at delivering a more reliable big‑data platform.

Big DataPerformancedatabase
0 likes · 10 min read
What’s New in Transwarp TDH 5.2.3? Key Performance and Stability Enhancements
dbaplus Community
dbaplus Community
Jan 13, 2019 · Databases

January 2019 DB-Engines Newsletter: Latest Database Releases & Key Features

The January 2019 DB-Engines newsletter compiles the newest releases, feature highlights, and performance improvements across RDBMS, NoSQL, NewSQL, time‑series, big‑data, domestic, and cloud database families, while also explaining the ranking methodology and providing download links for the full issue.

Big DataCloud ComputingNewSQL
0 likes · 41 min read
January 2019 DB-Engines Newsletter: Latest Database Releases & Key Features
Youzan Coder
Youzan Coder
Jan 9, 2019 · Big Data

How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive

This article details Youzan's transition from Hive to SparkSQL, covering platform architecture, usability and performance enhancements, migration strategies, automated engine selection, and future plans that together reduced resource consumption by up to 67% while handling thousands of daily jobs.

AvailabilityBig DataData Platform
0 likes · 13 min read
How Youzan Scaled 5,000 Daily SparkSQL Jobs: Migration Lessons from Hive
dbaplus Community
dbaplus Community
Jan 3, 2019 · Backend Development

Supercharging Elasticsearch for Billion-Row Queries: Practical Tips

This guide details how to optimize Elasticsearch for handling billions of daily records, covering core Lucene concepts, index and shard configuration, performance‑tuning parameters, and practical testing methods to achieve sub‑second query responses and long‑term data retention.

Big DataElasticsearchPerformance Optimization
0 likes · 13 min read
Supercharging Elasticsearch for Billion-Row Queries: Practical Tips
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2019 · Big Data

Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies

To address the long processing time caused by uneven Spark partitions when reading Kafka via the Direct approach, this article explains the SPARK‑22056 solution that modifies KafkaRDD.getPartitions to support a configurable 'topic.partition.subconcurrency' parameter, discusses its trade‑offs, and presents alternative repartition and multithreading techniques.

Big DataPartitioningScala
0 likes · 6 min read
Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2019 · Big Data

Understanding Spark Streaming Backpressure Mechanism

The article explains how Spark Streaming backpressure, introduced in version 1.5, automatically adjusts data ingestion rates based on processing delays, replaces manual rate limits, and details its architecture, configuration parameters, and usage for preventing data backlog and executor OOM.

Big DataRate ControlSpark
0 likes · 6 min read
Understanding Spark Streaming Backpressure Mechanism
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 1, 2019 · Big Data

Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview

The meetup on September 8, co‑hosted by InfoQ and Huawei Cloud, featured Databricks engineer Tathagata Das explaining Spark Structured Streaming’s concepts, fault‑tolerance, performance, event‑time handling, and real‑world use cases such as Apple’s security platform, highlighting its scalability and integration with various data sources.

Big DataSparkStructured Streaming
0 likes · 8 min read
Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview
Architects Research Society
Architects Research Society
Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

ApacheBig DataFlink
0 likes · 22 min read
Overview of Major Apache Big Data Processing Frameworks
Tencent Cloud Developer
Tencent Cloud Developer
Dec 28, 2018 · Big Data

Intelligent Operations for Tencent Cloud Big Data Platform: Challenges, Practices, and Future Directions

Tencent Cloud’s big‑data platform tackles massive, multi‑component clusters by deploying an AIOps framework that aggregates logs and metrics, applies statistical and machine‑learning anomaly detection, uses regression and reinforcement‑learning for job‑parameter optimization, and integrates offline‑online pipelines, achieving over 88 % precision while planning automated root‑cause analysis, productized tools, platformized algorithm integration, and cross‑domain model reuse.

Big DataCloud ComputingIntelligent Operations
0 likes · 20 min read
Intelligent Operations for Tencent Cloud Big Data Platform: Challenges, Practices, and Future Directions
Meituan Technology Team
Meituan Technology Team
Dec 27, 2018 · Artificial Intelligence

Meituan’s AI Initiatives: Large‑Scale Scheduling, Unmanned Delivery, and the Meituan Brain Knowledge Graph

Meituan’s AI division, now over 1,000 engineers with a 2 billion‑CNY quarterly budget, powers massive real‑time scheduling for 20 million daily orders, unmanned delivery pilots, and the “Meituan Brain” knowledge graph of billions of entities, delivering AI‑driven services across its entire platform.

AIBig DataLarge-Scale Scheduling
0 likes · 16 min read
Meituan’s AI Initiatives: Large‑Scale Scheduling, Unmanned Delivery, and the Meituan Brain Knowledge Graph
Xianyu Technology
Xianyu Technology
Dec 27, 2018 · Big Data

Device Fingerprinting and User Growth Architecture in Alibaba's Xianyu Platform

Alibaba’s Xianyu platform uses a multi‑signal device fingerprinting system, UMID, to uniquely identify users across Android and iOS devices, storing the data in sharded MySQL, HiStore OLAP, and Tair caches, enabling precise ad bidding, conversion tracking, and scalable user‑growth strategies.

Big DataInformation SecuritySystem Architecture
0 likes · 9 min read
Device Fingerprinting and User Growth Architecture in Alibaba's Xianyu Platform
Didi Tech
Didi Tech
Dec 26, 2018 · Industry Insights

How Didi Implements Full‑Chain Data Tiered Protection for Reliable Operations

Facing growing data‑driven pressures, Didi designed a full‑link data tiered protection framework that defines classification standards, integrates data levels across the entire pipeline, and applies concrete safeguards and tooling to improve resource allocation, backup reliability, and overall data reliability.

Big DataData GovernanceDidi
0 likes · 7 min read
How Didi Implements Full‑Chain Data Tiered Protection for Reliable Operations
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 20, 2018 · Big Data

Unlocking Alibaba’s Massive Cluster Data V2018: A Treasure Trove for Big‑Data Research

Alibaba has released the comprehensive Cluster Data V2018 dataset, detailing eight days of operation for 4,000 servers and their mixed online and offline workloads, including DAG information, enabling researchers to study large‑scale data‑center performance, resource utilization, scheduling algorithms, and derive new insights.

Big DataDAGDataset
0 likes · 7 min read
Unlocking Alibaba’s Massive Cluster Data V2018: A Treasure Trove for Big‑Data Research
Didi Tech
Didi Tech
Dec 18, 2018 · Big Data

Evolution and Architecture of Didi's Real-Time Computing Platform

From early self‑built Storm and Spark Streaming clusters to a unified YARN‑based Spark platform and finally a low‑latency Flink system with extended CEP and StreamSQL capabilities, Didi’s real‑time computing platform evolved through three stages, delivering multi‑tenant isolation, rich SQL processing, and dramatically reduced development costs.

Big DataCEPFlink
0 likes · 9 min read
Evolution and Architecture of Didi's Real-Time Computing Platform
Qunar Tech Salon
Qunar Tech Salon
Dec 18, 2018 · Big Data

Practical Insights on Deploying and Operating Elasticsearch at Scale

This article shares extensive practical experience from Qunar's large‑scale Elasticsearch deployment, covering suitable use cases, index‑type design, document ID strategies, scaling considerations for index and data volume, hardware sizing, and storage architecture recommendations to help newcomers avoid common pitfalls.

Big DataElasticsearchindexing
0 likes · 10 min read
Practical Insights on Deploying and Operating Elasticsearch at Scale
JD Tech
JD Tech
Dec 17, 2018 · Operations

Improving JD Intelligent Supply Chain Efficiency and System Stability for Major Sales Events

The article details JD's intelligent supply chain enhancements—including machine‑learning demand forecasting, a new "explosive product warehouse" model, non‑stock fulfillment visualization, blockchain‑based product traceability, and comprehensive system‑stability measures such as data‑consistency checkpoints, throughput buffering, and 24/7 incident response—to boost efficiency and reliability during large‑scale promotions.

Big DataBlockchainOperations
0 likes · 7 min read
Improving JD Intelligent Supply Chain Efficiency and System Stability for Major Sales Events
Youzan Coder
Youzan Coder
Dec 14, 2018 · Operations

Youzan Full‑Link Load Testing Architecture and Implementation

Youzan’s full‑link load‑testing architecture combines a traffic generator, a data‑factory pipeline, and the Maxim platform to replay realistic e‑commerce user actions, tag and isolate test traffic via unified headers, route reads/writes to shadow storage, and integrate Gatling for capacity planning, degradation, alarm, disaster‑recovery and throttling drills.

Big DataData IsolationDistributed Systems
0 likes · 13 min read
Youzan Full‑Link Load Testing Architecture and Implementation
JD Retail Technology
JD Retail Technology
Dec 12, 2018 · Big Data

Construction and Architecture of JD Overseas Data Analysis Platform (Columbus Platform)

JD.com’s overseas data analysis platform, dubbed the Columbus platform, combines a lightweight data warehouse deployment with standardized, customizable BI tools to provide real‑time and offline analytics, visualization, KPI management, and future self‑service reporting and predictive capabilities for its global e‑commerce operations.

AnalyticsBIBig Data
0 likes · 9 min read
Construction and Architecture of JD Overseas Data Analysis Platform (Columbus Platform)
Manbang Technology Team
Manbang Technology Team
Dec 12, 2018 · Big Data

Kafka Overview: Core Concepts, Architecture, Configuration, and Usage in Real-Time Computing

This article provides a comprehensive technical overview of Kafka, covering its core concepts, producer and consumer models, architecture, configuration parameters, replication mechanisms, performance optimizations, operational monitoring, tooling scripts, and related product implementations for real-time data processing.

ArchitectureBig DataKafka
0 likes · 18 min read
Kafka Overview: Core Concepts, Architecture, Configuration, and Usage in Real-Time Computing
JD Tech
JD Tech
Dec 11, 2018 · Big Data

Introduction to Graph Computing and the JoyGraph System

This article introduces graph computing, compares it with graph databases, surveys notable graph processing systems, and details the architecture, NUMA‑aware design, execution model, push/pull dual mode, and load‑balancing strategies of the JoyGraph framework while outlining its future development directions.

Big DataJoyGraphNUMA
0 likes · 9 min read
Introduction to Graph Computing and the JoyGraph System
NetEase Game Operations Platform
NetEase Game Operations Platform
Dec 5, 2018 · Big Data

Presto + Alluxio Architecture for Interactive Ad‑hoc Queries in NetEase Game Data Warehouse

This article describes how NetEase Games built a Presto‑based interactive ad‑hoc query platform backed by Alluxio caching to achieve sub‑10‑second query latency, outlines the architectural design, performance comparisons with other Hadoop‑based solutions, encountered issues, and future improvement plans.

AlluxioBig DataPerformance
0 likes · 10 min read
Presto + Alluxio Architecture for Interactive Ad‑hoc Queries in NetEase Game Data Warehouse
AntTech
AntTech
Dec 4, 2018 · Artificial Intelligence

Highlights from the 7th China Small‑and‑Medium Bank Development Summit on FinTech and Risk Management (Nov 29‑30 2018, Guangzhou)

The 7th China Small‑and‑Medium Bank Development Summit held in Guangzhou on November 29‑30 2018 gathered over 200 banking and fintech leaders to discuss the latest trends, challenges, and strategies in financial technology, digital transformation, risk control, and emerging technologies such as AI, big data, cloud and blockchain.

Artificial IntelligenceBig DataCloud Computing
0 likes · 14 min read
Highlights from the 7th China Small‑and‑Medium Bank Development Summit on FinTech and Risk Management (Nov 29‑30 2018, Guangzhou)
DataFunTalk
DataFunTalk
Dec 4, 2018 · Artificial Intelligence

Application and Exploration of Financial Knowledge Graphs

This article presents a comprehensive overview of financial knowledge graphs, covering their historical evolution, theoretical foundations, technical stack, implementation steps, and real‑world case studies in banking, regulatory technology, and securities, while highlighting community resources for AI and big‑data practitioners.

AIBig DataFinancial AI
0 likes · 14 min read
Application and Exploration of Financial Knowledge Graphs
JD Tech
JD Tech
Nov 28, 2018 · Operations

Technical Systems Behind JD Logistics for the 11.11 Global Shopping Festival

The article details how JD Logistics’ extensive warehouse, routing, distribution, and fulfillment systems—leveraging big data, AI, GIS, IoT, and distributed architectures—were engineered and optimized to handle the massive order surge during the 11.11 Global Shopping Festival with high throughput, low latency, and zero incidents.

AIBig DataGIS
0 likes · 8 min read
Technical Systems Behind JD Logistics for the 11.11 Global Shopping Festival
DataFunTalk
DataFunTalk
Nov 24, 2018 · Big Data

The Evolution of iQIYI's Big Data Analytics Platform

This article chronicles iQIYI’s journey from a simple Hive‑based data pipeline to the sophisticated, multi‑engine “Tongtian Tower” platform, detailing the development of the Magic Mirror system, the Gear workflow manager, BabelBD, the Monet visual analytics tool, and the integrated BI ecosystem that now supports billions of daily users.

BIBig Datadata engineering
0 likes · 18 min read
The Evolution of iQIYI's Big Data Analytics Platform
Tencent Cloud Developer
Tencent Cloud Developer
Nov 23, 2018 · Big Data

20 Free and Open-Source Data Visualization Tools

These 20 free and open‑source data visualization tools—from JavaScript libraries like D3.js and Chartist.js to user‑friendly platforms such as Datawrapper, Google Data Studio, and Tableau Public—enable businesses and analysts to transform raw data into interactive charts, maps, timelines, and dashboards, improving insight, decision‑making, and profitability.

Big DataData visualizationJavaScript libraries
0 likes · 12 min read
20 Free and Open-Source Data Visualization Tools