Tagged articles

3675 articles

Page 26 of 37

Jun 16, 2020 · Databases

How Youku Scales Billions of Video Nodes with Real‑Time Graph Databases

Facing billions of video entities and edges, Youku’s engineering team replaced traditional relational stores with a graph‑based knowledge platform, leveraging Alibaba’s Blink streaming engine and Lindorm to enable real‑time, incremental updates, unified UDF logic, and scalable feature computation for search and recommendation.

Big DataGraph DatabaseKnowledge graph

0 likes · 10 min read

How Youku Scales Billions of Video Nodes with Real‑Time Graph Databases

Big Data Technology & Architecture

Jun 16, 2020 · Big Data

Hot‑Warm Architecture in Elasticsearch 5.x: Node Types, Index Allocation and Curator Automation

The article explains how to design a time‑based Elasticsearch cluster using a hot‑warm architecture with dedicated master, hot, and warm nodes, shows how to configure node attributes, allocate indices via settings or Curator, and discusses best‑practice compression and rollover strategies for large‑scale log data.

Big DataElasticsearchHot‑Warm Architecture

0 likes · 8 min read

Hot‑Warm Architecture in Elasticsearch 5.x: Node Types, Index Allocation and Curator Automation

Java Backend Technology

Jun 16, 2020 · Big Data

How Kafka’s Architecture and Memory Pool Reduce JVM GC for High Throughput

This article explains how Kafka’s design—its broker architecture, use of sequential disk I/O, PageCache, Sendfile, and a custom memory buffer pool—optimizes JVM garbage collection and achieves massive throughput in big‑data messaging scenarios.

Big DataGC optimizationHigh Throughput

0 likes · 21 min read

How Kafka’s Architecture and Memory Pool Reduce JVM GC for High Throughput

Big Data Technology & Architecture

Jun 15, 2020 · Big Data

Hive Optimization Techniques and Best Practices for Big Data Processing

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, tuning MapReduce jobs, handling data skew in joins, and selecting appropriate storage formats for large‑scale data warehouses.

Big DataData SkewHiveQL

0 likes · 19 min read

Hive Optimization Techniques and Best Practices for Big Data Processing

JD Retail Technology

Jun 15, 2020 · Industry Insights

How JD.com’s Smart Supply Chain Powered the 618 Mega‑Sale: Strategies & Algorithms

The article details JD.com’s Y Business Management Department’s data‑driven, algorithmic approaches to inventory forecasting, replenishment, allocation, and fulfillment during the 618 promotion, describing how big‑data predictions, dynamic programming, ADMM column generation, and cross‑department collaboration optimized costs, reduced stockouts, and enhanced customer experience amid pandemic challenges.

Algorithmic ForecastingBig DataE‑commerce Operations

0 likes · 21 min read

How JD.com’s Smart Supply Chain Powered the 618 Mega‑Sale: Strategies & Algorithms

DataFunTalk

Jun 14, 2020 · Big Data

Designing an Offline Big Data Processing Architecture Based on Object Storage

This article presents a comprehensive offline big‑data processing framework that leverages scalable object storage for PB‑level data, details storage and compute engine requirements, compares cost options, describes data pipeline design, and showcases an e‑commerce case study with Spark‑driven analytics.

Big DataCost OptimizationSpark

0 likes · 19 min read

Designing an Offline Big Data Processing Architecture Based on Object Storage

DataFunTalk

Jun 14, 2020 · Big Data

Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI

This article describes how iQIYI evaluated various OLAP engines, selected Apache Druid for real‑time analytics, detailed its architecture, identified performance bottlene‑cks in Coordinator, Overlord and indexing, applied configuration and resource‑allocation optimizations, and built a user‑friendly RAP platform to democratize real‑time data analysis.

Apache DruidBig DataData Platform

0 likes · 15 min read

Practical Experience and Optimization of Apache Druid for Real‑Time OLAP at iQIYI

Big Data Technology & Architecture

Jun 13, 2020 · Big Data

Achieving Exactly-Once Semantics in Kafka and Spark Streaming

This article explains the three message delivery semantics in distributed stream processing, compares Kafka‑Spark Streaming integration methods (receiver vs direct stream), and details how to achieve exactly‑once guarantees through idempotent or transactional writes, including code examples.

Big DataExactly-OnceKafka

0 likes · 8 min read

Achieving Exactly-Once Semantics in Kafka and Spark Streaming

Beike Product & Technology

Jun 12, 2020 · Big Data

Design and Implementation of SQL on Streaming (SQL 1.0 → SQL 2.0) in a Real‑Time Computing Platform

This article describes the evolution of a real‑time computing platform from SQL 1.0 built on Spark Structured Streaming to SQL 2.0 powered by Flink‑SQL, covering dynamic tables, continuous queries, dimension‑table joins, cache optimization, DDL extensions, platformization, operational challenges and future roadmap.

Big DataDimension TableFlink

0 likes · 19 min read

Design and Implementation of SQL on Streaming (SQL 1.0 → SQL 2.0) in a Real‑Time Computing Platform

Alibaba Cloud Developer

Jun 12, 2020 · Operations

How VTrace Automates Cloud‑Scale Packet‑Loss Diagnosis

VTrace is an automated diagnostic system that leverages big‑data processing to instantly reconstruct traffic paths and pinpoint the root causes of persistent packet loss in cloud‑scale overlay networks, dramatically simplifying network operations and cutting troubleshooting time from hours to minutes.

Big DataPacket LossSIGCOMM

0 likes · 12 min read

How VTrace Automates Cloud‑Scale Packet‑Loss Diagnosis

Architect

Jun 11, 2020 · Big Data

Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing

This article explains Apache Flink's distributed system architecture—including JobManager, ResourceManager, TaskManager, and Dispatcher—covers session and job deployment modes, data transfer mechanisms, event‑time handling with watermarks, various state types and backends, scaling strategies, and the checkpoint/savepoint recovery process.

Apache FlinkBig DataEvent Time

0 likes · 15 min read

Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing

Big Data Technology Architecture

Jun 11, 2020 · Big Data

Optimizing Workflow in Data Warehouse Construction

This article analyzes workflow scenarios in data warehouse construction, proposes an optimization scheme that abstracts workflow nodes into task and instance layers, and demonstrates how task attributes and generation rules can improve configurability, dependency management, and collaborative development for large‑scale data warehouse projects.

Big DataETLWorkflow

0 likes · 19 min read

Optimizing Workflow in Data Warehouse Construction

DataFunTalk

Jun 11, 2020 · Big Data

Real-time Multi-dimensional Analytics and SlimBase State Backend at Kuaishou: Flink Applications and Optimizations

This article presents Kuaishou's extensive use of Apache Flink for real-time multi-dimensional analytics, detailing the platform's architecture, cluster scale, data processing pipelines, the design of a shared state storage engine called SlimBase, and performance improvements achieved through replacing RocksDB with a customized HBase‑based solution.

Big DataFlinkKuaishou

0 likes · 15 min read

Real-time Multi-dimensional Analytics and SlimBase State Backend at Kuaishou: Flink Applications and Optimizations

Alibaba Cloud Developer

Jun 11, 2020 · Artificial Intelligence

How to Maximize Video Views with a Multi‑Objective Exposure Optimization Model

This article presents a data‑driven approach for allocating limited video exposure resources by building a PV‑click‑CTR (P2C) sensitivity model and a multi‑objective optimization framework that balances overall view volume and fairness across scenes, validated through offline metrics and online bucket tests.

Big Dataalgorithmexposure optimization

0 likes · 9 min read

How to Maximize Video Views with a Multi‑Objective Exposure Optimization Model

58 Tech

Jun 10, 2020 · Big Data

Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

This article details the evolution of 58 Tongcheng Bao's real‑time data warehouse, describing the initial Spark‑Streaming architecture, its limitations, and the redesign using Flink with a layered ODS‑DWD‑DWS‑APP model, data‑quality monitoring, join techniques, and the resulting improvements in latency and accuracy.

Big DataData QualityFlink

0 likes · 9 min read

Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

Big Data Technology & Architecture

Jun 9, 2020 · Big Data

Comprehensive Overview and Best Practices for Apache Spark Streaming

This article provides a detailed introduction to Spark Streaming, covering its architecture, DStream concepts, initialization, data sources, transformations, windowed aggregations, output operations, checkpointing, fault‑tolerance semantics, deployment, performance tuning, and monitoring for building reliable high‑throughput streaming applications.

Big DataDstreamScala

0 likes · 17 min read

Comprehensive Overview and Best Practices for Apache Spark Streaming

Big Data Technology & Architecture

Jun 7, 2020 · Big Data

A Unified View of SQL‑on‑Hadoop Systems: Architecture, Execution Plans, Optimizations, and Storage Formats

The article provides a comprehensive overview of SQL‑on‑Hadoop query engines such as Hive, Impala, Presto and Spark SQL, comparing their runtime frameworks, core components, compilation steps, optimizer strategies, CPU/IO efficiency techniques, storage formats like ORC and Parquet, and resource management in a unified perspective.

Big DataQuery EngineSQL on Hadoop

0 likes · 24 min read

A Unified View of SQL‑on‑Hadoop Systems: Architecture, Execution Plans, Optimizations, and Storage Formats

Big Data Technology & Architecture

Jun 4, 2020 · Big Data

Kafka for Data Ingestion and Event Distribution: Production‑Consumer and Publish‑Subscribe Patterns

This article explains how Kafka can be used for data ingestion and event distribution by illustrating production‑consumer and publish‑subscribe models, describing core concepts such as topics, partitions and consumer groups, and offering practical design options for handling different event scenarios.

Big DataEvent DistributionKafka

0 likes · 9 min read

Big Data Technology Architecture

Jun 4, 2020 · Big Data

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big DataData MigrationHadoop

0 likes · 23 min read

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

Tencent Cloud Developer

Jun 4, 2020 · Industry Insights

How Enterprise Architecture Drives Banking Digital Transformation

The article reviews a Tencent Cloud closed‑door forum where industry experts dissect the banking sector’s digital transformation, outlining enterprise‑architecture‑driven strategies, key enabling technologies such as cloud, AI, big data and blockchain, and practical insights from Q&A sessions and ThoughtWorks’ models.

Artificial IntelligenceBankingBig Data

0 likes · 19 min read

How Enterprise Architecture Drives Banking Digital Transformation

Top Architect

Jun 4, 2020 · Big Data

Elasticsearch Deployment and Use Cases in Major Chinese Companies

This article reviews how leading Chinese internet companies such as JD.com, Ctrip, Qunar, 58.com, and Didi have adopted Elasticsearch for large‑scale order search, log analysis, real‑time monitoring, and security, describing the evolution of cluster architectures, shard strategies, multi‑cluster pipelines, and performance optimizations.

Big DataElasticsearchScalability

0 likes · 12 min read

Elasticsearch Deployment and Use Cases in Major Chinese Companies

Big Data Technology & Architecture

Jun 4, 2020 · Big Data

Understanding Flink StreamingFileSink: File States, Rolling Policies, and Example Code

This article explains Flink's StreamingFileSink in version 1.10.0, covering how files transition through In‑progress, Pending, and Finished states, the bucket assignment and rolling policies, and provides a complete Java example for writing string data to files.

Big DataFile RollingFlink

0 likes · 6 min read

Understanding Flink StreamingFileSink: File States, Rolling Policies, and Example Code

Huawei Cloud Developer Alliance

Jun 3, 2020 · Big Data

How to Connect Python to Presto on Huawei MRS: Step-by-Step Guide & Common Pitfalls

Learn how to set up a Python environment on an Ubuntu ECS, install the presto‑python‑client and PyHive libraries, configure Kerberos and SSL credentials, run sample queries against a Presto coordinator, and avoid typical errors such as NTP, SSL and authentication issues.

Big DataKerberosPresto

0 likes · 6 min read

How to Connect Python to Presto on Huawei MRS: Step-by-Step Guide & Common Pitfalls

dbaplus Community

Jun 2, 2020 · Big Data

How Cainiao Built a Scalable Real‑Time Data Warehouse with Flink

Facing growing order volumes and strict timeliness demands, Cainiao’s tech team overhauled its real‑time data warehouse by redesigning data models, adopting Flink for streaming computation, upgrading data services, and exploring innovative tools, sharing practical lessons and future directions for large‑scale logistics analytics.

Big DataFlinkLogistics

0 likes · 18 min read

How Cainiao Built a Scalable Real‑Time Data Warehouse with Flink

Tencent Cloud Developer

Jun 2, 2020 · Big Data

Real‑time OLAP Analytics for QQ Music Using ClickHouse and Tencent Cloud EMR

QQ Music’s new real‑time OLAP platform, built on ClickHouse, Superset and Tencent Cloud EMR, ingests petabyte‑scale streaming and batch data with SSD‑backed ZooKeeper, load‑balanced writes, optimized partitions and read/write separation, delivering second‑level query responses that are several times faster than Hive, Presto or SparkSQL and enabling self‑service BI for thousands of users.

Big DataClickHouseOLAP

0 likes · 12 min read

Real‑time OLAP Analytics for QQ Music Using ClickHouse and Tencent Cloud EMR

Big Data Technology & Architecture

May 31, 2020 · Big Data

Zookeeper Architecture, Roles, and Core Mechanisms

This article provides a comprehensive overview of Apache Zookeeper, detailing its purpose as a distributed coordination service, its key uses such as cluster management, configuration management, naming, distributed locking, and queue management, as well as its architecture, message types, Znode structures, read/write processes, Zab and Fast Paxos protocols, server states, and watcher mechanism.

Big DataConfiguration ManagementDistributed Coordination

0 likes · 14 min read

Architect

May 30, 2020 · Big Data

Understanding Flink’s Unified Programming API for Batch and Streaming Jobs

This article examines Apache Flink’s programming model, comparing its batch DataSet API with the streaming DataStream API, detailing class hierarchies, key code examples such as groupBy and job submission, and explaining how both paradigms are unified into a common JobGraph representation.

Batch ProcessingBig DataFlink

0 likes · 9 min read

Understanding Flink’s Unified Programming API for Batch and Streaming Jobs

MaGe Linux Operations

May 30, 2020 · Big Data

What Is Kafka? A Deep Dive into Distributed Streaming and Messaging

Kafka is an Apache‑hosted distributed streaming platform that provides high‑throughput, durable, publish‑subscribe messaging, originally developed by LinkedIn; this article explains its core concepts, message system classifications, architecture components, APIs, replication, consumer groups, and guarantees, comparing it with other messaging solutions.

Big DataDistributed StreamingKafka

0 likes · 17 min read

What Is Kafka? A Deep Dive into Distributed Streaming and Messaging

Architects Research Society

May 29, 2020 · Big Data

Outcome‑Driven Enterprise Data Strategy: The Importance of Tools, Technology, and Automation

The article explains how a well‑defined technology roadmap, encompassing architecture, data governance, storage, analytics, and automation, is essential for aligning tools and techniques with business goals to achieve a successful, outcome‑driven enterprise data strategy.

AIAutomationBig Data

0 likes · 10 min read

Outcome‑Driven Enterprise Data Strategy: The Importance of Tools, Technology, and Automation

iQIYI Technical Product Team

May 29, 2020 · Big Data

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

iQiyi’s full‑link automated monitoring platform unifies tracing, metric and log collection with deep offline and real‑time analysis, delivering a DAG‑based call graph, near‑real‑time ingestion of tens of millions of logs, multi‑dimensional alerts and rapid root‑cause diagnosis that cut error‑lookup time by over 50 % and now serves as a core component of the company’s microservice reference architecture.

ArchitectureBig DataMetrics

0 likes · 12 min read

iQiyi's Full-Link Automated Monitoring Platform: Design and Implementation

Big Data Technology & Architecture

May 28, 2020 · Big Data

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

This article provides a comprehensive guide on detecting Hadoop system bottlenecks, analyzing resource constraints, and applying practical MapReduce performance tuning techniques—including baseline creation, counter analysis, combiner usage, compression, and proper Writable types—to achieve optimal big‑data processing efficiency.

Big DataHadoopMapReduce

0 likes · 11 min read

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

Big Data Technology & Architecture

May 26, 2020 · Information Security

Step-by-Step Guide to Integrating Kerberos Authentication with the Cloudera Platform

This article provides a comprehensive tutorial on Kerberos fundamentals, its authentication workflow, and detailed procedures for installing, configuring, and enabling Kerberos security on a Cloudera (Hadoop) cluster running on CentOS, including code snippets, configuration files, and post‑deployment testing steps.

AuthenticationBig DataCloudera

0 likes · 17 min read

Step-by-Step Guide to Integrating Kerberos Authentication with the Cloudera Platform

StarRing Big Data Open Lab

May 26, 2020 · Cloud Computing

How TCOS 2.0 Empowers Big Data, AI, and Cloud Workloads with Enhanced Compatibility

TCOS 2.0, the container operating system from Transwarp, expands compatibility to Windows, ARM, MIPS, and domestic platforms, adds GPU heterogeneous scheduling, HPA autoscaling, enhanced local storage management, and improved monitoring, providing a robust foundation for big data, AI, and cloud-native applications.

Big DataContainerGPU scheduling

0 likes · 11 min read

How TCOS 2.0 Empowers Big Data, AI, and Cloud Workloads with Enhanced Compatibility

dbaplus Community

May 24, 2020 · Big Data

Why Cross-Index Queries Matter in Elasticsearch and How to Implement Them

This article explains why Elasticsearch cross-index queries are essential, outlines their technical principles, showcases classic use cases such as business analytics, big‑data pipelines and log management, and provides practical methods, code examples, and performance considerations for effective implementation.

Big DataCross-Index QueryElasticsearch

0 likes · 10 min read

Why Cross-Index Queries Matter in Elasticsearch and How to Implement Them

Big Data Technology & Architecture

May 22, 2020 · Big Data

Understanding Kafka's ZooKeeper Paths and Their Stored Metadata

This article explains how ZooKeeper stores Kafka's coordination data by detailing the predefined ZK paths, the JSON structures for broker, topic, partition, controller, and consumer information, and the auxiliary nodes used for replica election and partition reassignment.

Big DataBroker metadataKafka

0 likes · 8 min read

Understanding Kafka's ZooKeeper Paths and Their Stored Metadata

Architect

May 21, 2020 · Big Data

Parallel Execution of Multiple Spark Jobs to Optimize Resource Utilization and Reduce Parquet File Count

This article examines how to run several Spark jobs concurrently on a shared SparkContext, balancing full CPU‑vcore utilization with the need to generate fewer Parquet files, and presents practical experiments, scheduling strategies, and performance results.

Big DataJob SchedulingParallelism

0 likes · 12 min read

Parallel Execution of Multiple Spark Jobs to Optimize Resource Utilization and Reduce Parquet File Count

macrozheng

May 21, 2020 · Big Data

Mastering Kafka: Core Concepts, Architecture, and Reliability Guarantees

This comprehensive guide covers Kafka's definition, publish/subscribe model, key components, storage mechanisms, producer and consumer strategies, and reliability features such as ACK levels, ISR, and exactly‑once semantics, providing a solid foundation for real‑time big‑data processing.

Big DataDistributed SystemsKafka

0 likes · 16 min read

Mastering Kafka: Core Concepts, Architecture, and Reliability Guarantees

Big Data Technology Architecture

May 21, 2020 · Big Data

Near Real-Time Ingestion, Analysis, Incremental Pipelines, and Data Distribution with Apache Hudi

The article explains how Apache Hudi enables near‑real‑time data ingestion from various sources, supports low‑latency analytics, provides incremental processing pipelines, and simplifies data distribution on Hadoop, improving efficiency and reducing operational complexity.

Apache HudiBig DataHadoop

0 likes · 6 min read

Near Real-Time Ingestion, Analysis, Incremental Pipelines, and Data Distribution with Apache Hudi

Suning Technology

May 20, 2020 · Artificial Intelligence

How AI and Big Data Can Transform Smart Communities into One‑Hour Life Zones

The article examines how smart community initiatives, driven by AI, big data and cloud technologies, can overcome regional imbalances and information silos to create integrated, efficient, and resident‑friendly services that form a one‑hour lifestyle ecosystem.

AIBig DataCloud Computing

0 likes · 6 min read

How AI and Big Data Can Transform Smart Communities into One‑Hour Life Zones

Suning Technology

May 19, 2020 · Big Data

Unlocking Big Data Value: Strategies for Public Data Sharing and Governance

The article examines how China’s push for public data sharing, led by Suning chairman Zhang Jindong, proposes a data governance committee, a unified sharing platform, security frameworks, and education initiatives to break information silos and accelerate the digital economy.

Big DataData GovernanceDigital Economy

0 likes · 5 min read

Unlocking Big Data Value: Strategies for Public Data Sharing and Governance

Huawei Cloud Developer Alliance

May 19, 2020 · Cloud Computing

Huawei Cloud May Newsletter: AI, Big Data, IoT, Security & Hands‑On Guides

The May 2024 Huawei Cloud Community newsletter curates expert articles and tutorials on AI dark data handling, Apache CarbonData advances, agile retrospectives, IoT Greengrass on Raspberry Pi, Kubernetes networking, WAF anti‑scraping, openEuler contributions, Kunpeng cloud case studies, and ModelArts practical projects.

AIBig DataIoT

0 likes · 6 min read

DataFunTalk

May 18, 2020 · Artificial Intelligence

Intelligent Investment Research and Financial Sentiment Monitoring with NLP and Big Data

This article describes how advanced natural‑language‑processing, big‑data, and deep‑learning techniques are integrated into an end‑to‑end platform for financial asset management, covering large‑scale bid‑tender text analysis, few‑shot sentiment monitoring, model architectures, data‑enhancement methods, and practical deployment results.

Big DataFinancial AINLP

0 likes · 28 min read

Intelligent Investment Research and Financial Sentiment Monitoring with NLP and Big Data

Alibaba Cloud Developer

May 17, 2020 · Big Data

Inside Alibaba’s Fuxi DAG 2.0: Boosting Big Data Workloads with Dynamic Scheduling

Alibaba’s Fuxi DAG 2.0 redesign separates logical and physical graphs, introduces dynamic scheduling, unified offline and near‑real‑time execution, and a flexible bubble mode, enabling massive big‑data jobs to run up to five times faster while dramatically reducing resource waste.

Big DataDAGDistributed Systems

0 likes · 38 min read

Inside Alibaba’s Fuxi DAG 2.0: Boosting Big Data Workloads with Dynamic Scheduling

Selected Java Interview Questions

May 16, 2020 · Big Data

How Reddit Counts Page Views at Scale Using HyperLogLog and Kafka

The article explains Reddit's large‑scale page‑view counting system, detailing its real‑time requirements, the challenges of naive hash‑set storage, and how a hybrid approach using linear probability and HyperLogLog algorithms together with Kafka, Redis, and Cassandra achieves accurate, low‑memory, near‑real‑time analytics.

Big DataHyperLogLogKafka

0 likes · 7 min read

How Reddit Counts Page Views at Scale Using HyperLogLog and Kafka

Big Data Technology & Architecture

May 16, 2020 · Big Data

Apache Kylin Single‑Node Installation Guide and Troubleshooting

This article provides a comprehensive step‑by‑step guide for installing Apache Kylin on a single machine, covering required software versions, environment variable configuration, Spark dependency handling, main Kylin properties, verification steps, and detailed solutions to common errors such as Zookeeper host issues, HTTP 404, Jackson conflicts, MapReduce jobhistory problems, missing Spark classes, HiveConf errors, and YARN shuffle service configuration.

Apache KylinBig DataHadoop

0 likes · 26 min read

Apache Kylin Single‑Node Installation Guide and Troubleshooting

Didi Tech

May 15, 2020 · Artificial Intelligence

Key Factors for Effective Data Product Development and Algorithm Engineer Evaluation

Effective data product development hinges on deep business understanding, clear metric decomposition, rigorous model evaluation, and translating technical performance into business impact, while algorithm engineers are best assessed by publication quality, problem significance, algorithmic contribution, and practical interview questions on model tuning and improvement.

Big DataData Productalgorithm evaluation

0 likes · 10 min read

Key Factors for Effective Data Product Development and Algorithm Engineer Evaluation

DataFunTalk

May 14, 2020 · Big Data

Building a Real-Time Data Warehouse at Cainiao: Architecture, Model Upgrades, Engine Enhancements, and Service Innovations

This article shares Cainiao's practical experience in constructing a real-time data warehouse, covering the shortcomings of the previous architecture, the evolution of data models, the migration to Flink with advanced features like retraction and timer services, and the modernization of data services and tooling to support high‑throughput logistics scenarios.

Big DataData ServiceFlink

0 likes · 16 min read

Building a Real-Time Data Warehouse at Cainiao: Architecture, Model Upgrades, Engine Enhancements, and Service Innovations

Big Data Technology & Architecture

May 14, 2020 · Big Data

Understanding Flink 1.10 TaskManager Memory Model and Configuration Parameters

This article explains the new unified TaskManager memory model introduced in Flink 1.10, detailing each memory component, its configuration parameters, how they map to JVM settings, and practical guidance for both standalone and containerized deployments, including a concrete YARN example.

BatchBig DataFlink

0 likes · 10 min read

Understanding Flink 1.10 TaskManager Memory Model and Configuration Parameters

Top Architect

May 14, 2020 · Big Data

Kafka Overview, Architecture, Installation, and Operational Guide

This article provides a comprehensive introduction to Kafka, covering its definition, message queue concepts, architecture components, installation steps, configuration details, startup procedures, operational commands, producer and consumer mechanisms, reliability guarantees, partition strategies, offset management, and performance optimizations.

Big DataConsumerInstallation

0 likes · 22 min read

Kafka Overview, Architecture, Installation, and Operational Guide

Big Data Technology & Architecture

May 13, 2020 · Big Data

Analysis of Hadoop HDFS Data Read and Write Process

This article explains the underlying principles of Hadoop HDFS read and write operations, detailing how the client interacts with NameNode and DataNodes, the role of FsDataInputStream and FsDataOutputStream, block location retrieval, pipeline replication, and file closure steps.

Big DataData ReadData Write

0 likes · 8 min read

Analysis of Hadoop HDFS Data Read and Write Process

dbaplus Community

May 12, 2020 · Cloud Native

Migrating Massive Big‑Data Services to Kubernetes: Lessons from Tongcheng‑eLong

This article details how Tongcheng‑eLong transitioned from Docker‑Host deployments to a Kubernetes‑based platform for hundreds of storage and compute services, covering network integration, IP management, service synchronization, storage strategies, operator development, monitoring, logging, and the challenges and future plans they encountered.

Big DataCloud NativeDocker

0 likes · 17 min read

Migrating Massive Big‑Data Services to Kubernetes: Lessons from Tongcheng‑eLong

Architect

May 12, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark and Hadoop‑compatible storage to provide efficient ingestion, incremental processing, and multiple query modes such as snapshot, incremental, and read‑optimized for large analytical datasets.

Apache HudiBig DataData Lake

0 likes · 11 min read

An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Tencent Tech

May 11, 2020 · Big Data

How Tencent Scaled Elasticsearch to Thousands of Nodes: Core Kernel Optimizations Revealed

This article details Tencent's large‑scale Elasticsearch deployment, covering its massive usage scenarios, the availability, performance, cost and scalability challenges faced, and the comprehensive kernel‑level optimizations—including memory‑based throttling, storage‑model merging, off‑heap caching, rollup and metadata improvements—that enable PB‑level clusters with high reliability and low expense.

Big DataDistributed SystemsElasticsearch

0 likes · 27 min read

How Tencent Scaled Elasticsearch to Thousands of Nodes: Core Kernel Optimizations Revealed

Big Data Technology & Architecture

May 10, 2020 · Big Data

Apache Beam Overview: Architecture, Programming Model, PCollection, Pipeline and Transform

This article provides a comprehensive introduction to Apache Beam, covering its unified batch‑and‑stream processing architecture, programming model, workflow patterns, Lambda and Kappa architectures, the characteristics of PCollection, pipeline construction, core transforms, I/O handling, and includes practical code examples.

Apache BeamBig DataLambda architecture

0 likes · 14 min read

Apache Beam Overview: Architecture, Programming Model, PCollection, Pipeline and Transform

Java Captain

May 8, 2020 · Big Data

Elasticsearch Adoption and Architecture Cases in Major Chinese Companies

The article surveys how leading Chinese tech firms such as JD Daojia, Ctrip, Qunar, 58.com, and Didi have adopted Elasticsearch for large‑scale search, real‑time analytics, and security, detailing their evolving cluster architectures, shard strategies, data volumes, and supporting services.

ArchitectureBig DataDistributed Systems

0 likes · 11 min read

Elasticsearch Adoption and Architecture Cases in Major Chinese Companies

Architecture Digest

May 8, 2020 · Big Data

Elasticsearch Adoption Cases in Chinese Companies: JD.com, Ctrip, Qunar, 58.com, Didi and More

This article surveys how major Chinese internet companies such as JD.com, Ctrip, Qunar, 58.com and Didi have adopted Elasticsearch and the Elastic Stack for high‑volume order queries, log analysis, real‑time monitoring, security analytics, and large‑scale distributed search, describing their architecture evolution, shard strategies, and operational practices.

Big DataLog AnalyticsSearch Architecture

0 likes · 16 min read

Elasticsearch Adoption Cases in Chinese Companies: JD.com, Ctrip, Qunar, 58.com, Didi and More

HomeTech

May 7, 2020 · Big Data

Construction and Evaluation of User Profiles: Identification, Tagging, Storage, and Quality Assessment

This article explains how to build user profiles by distinguishing persona from profile, describing the evolution of ID‑mapping techniques, designing a multi‑layer tag system, implementing statistical, interest, and model tags, storing the data in Hive, HBase, Codis and Elasticsearch, and finally evaluating profile timeliness, coverage and accuracy.

Big Datadata storagedata tagging

0 likes · 11 min read

Construction and Evaluation of User Profiles: Identification, Tagging, Storage, and Quality Assessment

Big Data Technology & Architecture

May 6, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines

This article provides a comprehensive, hands‑on tutorial for preparing three VMs, installing JDK and Hadoop, configuring core‑site.xml, hdfs‑site.xml, mapred‑site.xml, yarn‑site.xml, setting environment variables, distributing the package, starting HDFS and YARN, and verifying the cluster via web UI and jps commands.

Big DataCluster SetupHDFS

0 likes · 14 min read

Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines

Architecture Digest

May 4, 2020 · Databases

HBase Overview, Architecture, Installation, and Basic Shell Operations

This article provides a comprehensive introduction to HBase, covering its origins, key characteristics, architecture components, installation steps, basic shell commands for table management, data structures, read/write processes, and high‑availability configuration within the Hadoop ecosystem.

Big DataHBaseHadoop

0 likes · 14 min read

HBase Overview, Architecture, Installation, and Basic Shell Operations

Architecture Digest

May 3, 2020 · Big Data

Kafka Concept Overview

This article provides a comprehensive introduction to Kafka, covering its definition, message‑queue models, architecture components, installation steps, configuration details, producer and consumer mechanisms, reliability guarantees, partition assignment strategies, offset management, and high‑performance read/write techniques.

Big DataConsumerKafka

0 likes · 20 min read

Open Source Linux

May 2, 2020 · Big Data

Mastering Apache Zookeeper: Core Concepts and Real-World Big Data Applications

Apache Zookeeper is an open‑source coordination service that provides reliable distributed synchronization, configuration management, and naming for big‑data components such as Hadoop, HBase, and Kafka, offering features like hierarchical znode structures, watches, master election, and distributed locks to maintain cluster health.

Apache ZookeeperBig DataCluster Management

0 likes · 17 min read

Mastering Apache Zookeeper: Core Concepts and Real-World Big Data Applications

21CTO

Apr 30, 2020 · Big Data

How to Choose a Worthwhile Technology: A Big Data Engineer’s 3‑Step Framework

The article outlines a three‑dimensional framework—technical depth, ecosystem breadth, and evolution capability—to help professionals evaluate whether a technology is worth investing time in, illustrated with real‑world examples from Hadoop, Spark, and Flink.

Big DataFlinkHadoop

0 likes · 10 min read

How to Choose a Worthwhile Technology: A Big Data Engineer’s 3‑Step Framework

Didi Tech

Apr 30, 2020 · Big Data

Didi’s Real‑Time Computing Practices with Apache Flink and StreamSQL

Didi has unified its real‑time computing on Apache Flink, creating an enhanced StreamSQL service with extended DDL, built‑in parsers and UDX, supporting thousands of nodes, millions of jobs, and trillions of daily records, while addressing state management, high availability, multi‑language UDFs, and pursuing real‑time ML and data‑warehouse integration.

Apache FlinkBig DataDidi

0 likes · 13 min read

Didi’s Real‑Time Computing Practices with Apache Flink and StreamSQL

Zhengtong Technical Team

Apr 30, 2020 · Big Data

Design and Performance Optimization of an Intelligent Search System for City Operations Big Data Center

This article describes the background, requirement‑driven prototype design, Elasticsearch‑based query‑DSL selection, and extensive performance tuning—including hardware configuration, indexing parameters, JVM and garbage‑collector adjustments—that enabled real‑time ingestion of hundreds of thousands of records and sub‑second search responses for a city‑wide data platform.

Big DataCluster TuningElasticsearch

0 likes · 12 min read

Design and Performance Optimization of an Intelligent Search System for City Operations Big Data Center

Suning Technology

Apr 30, 2020 · Big Data

How AI‑Powered ‘Fast‑Pick’ Warehouses Boost Carrefour’s Delivery Speed

Carrefour’s new AI‑driven fast‑pick warehouse system, built by Suning, cuts picking time by half and enables over 95% of more than 1,000 daily orders to be delivered within an hour, illustrating the power of big‑data logistics in modern retail.

AIBig DataDigital Transformation

0 likes · 6 min read

How AI‑Powered ‘Fast‑Pick’ Warehouses Boost Carrefour’s Delivery Speed

Alibaba Cloud Developer

Apr 28, 2020 · Big Data

How Alibaba Tests Big Data AI Applications: Six Challenges and Solutions

This article explains how Alibaba's search, recommendation, and advertising platforms handle the unique quality challenges of big‑data AI applications, detailing six major testing problems and the comprehensive strategies—including functional, real‑time, performance, and stability testing—used to ensure reliable online services.

AI testingBig DataDevOps

0 likes · 27 min read

How Alibaba Tests Big Data AI Applications: Six Challenges and Solutions

Big Data Technology & Architecture

Apr 28, 2020 · Big Data

Big Data Practice Exercises: Spark, Kafka, and MySQL Integration with Scala and Java

This article presents a series of hands‑on big‑data exercises, including Spark Scala data analysis, Kafka topic creation and custom partitioning, and MySQL table design with Scala‑based streaming calculations, providing complete source code and step‑by‑step solutions for each task.

Big DataKafkaMySQL

0 likes · 25 min read

Big Data Practice Exercises: Spark, Kafka, and MySQL Integration with Scala and Java

Python Programming Learning Circle

Apr 28, 2020 · Big Data

Multiple Ways to Create New Columns in PySpark DataFrames

This tutorial explains several techniques for adding new columns to PySpark DataFrames—including native Spark functions, user‑defined functions, RDD transformations, Pandas UDFs, and SQL queries—while demonstrating data loading, schema handling, and code examples for each method.

Big DataColumn CreationPySpark

0 likes · 9 min read

Multiple Ways to Create New Columns in PySpark DataFrames

Suning Technology

Apr 27, 2020 · Artificial Intelligence

How 20 Retail Tech Companies Navigate Pandemic Challenges and Spot New Opportunities

A comprehensive survey by Suning Retail Technology Institute and the Asia‑Pacific Smart Retail Industry Alliance reveals how twenty retail technology firms faced COVID‑19 disruptions, optimized operations, accelerated digital transformation, and identified emerging growth points in AI, big data, and omnichannel retail.

Artificial IntelligenceBig DataCOVID-19 Impact

0 likes · 17 min read

How 20 Retail Tech Companies Navigate Pandemic Challenges and Spot New Opportunities

dbaplus Community

Apr 26, 2020 · Big Data

Evolving from Data Warehouses to Data Middle Platforms: Architecture & Practices

This talk reviews China's big‑data evolution from early enterprise data warehouses to modern data middle platforms, outlines core architectural components, technology selections, data development practices, lifecycle and quality management, and shares practical Q&A insights for building scalable, cost‑effective data infrastructures.

Big DataData ArchitectureData Governance

0 likes · 28 min read

Evolving from Data Warehouses to Data Middle Platforms: Architecture & Practices

Suning Technology

Apr 26, 2020 · Big Data

How Data Is Redefining Retail: From Online‑Offline Fusion to C2M Growth

Amid the rapid rise of the digital economy, China now treats data as a new production factor, driving innovations such as 5G, AI, and big-data-enabled retail, where companies like Suning integrate online and offline channels, leverage C2M models, and boost efficiency through data-driven operations.

Big DataC2MChina

0 likes · 7 min read

How Data Is Redefining Retail: From Online‑Offline Fusion to C2M Growth

Big Data Technology & Architecture

Apr 25, 2020 · Big Data

Integrating SparkSQL with Hive: Configuration, MetaStore Setup, and Example Scala Code

This article explains the differences between Spark on Hive and Hive on Spark, then provides step‑by‑step instructions for configuring Hive MetaStore, setting up SparkSQL to use Hive, and demonstrates a complete Scala program that creates a Hive table, loads data, and queries it.

Big DataData IntegrationMetaStore

0 likes · 7 min read

Integrating SparkSQL with Hive: Configuration, MetaStore Setup, and Example Scala Code

360 Tech Engineering

Apr 24, 2020 · Operations

Design and Implementation of Instance Configuration Management in xManager

This article explains how xManager implements comprehensive instance configuration management for big‑data services, covering configuration groups, versioning, node‑level differentiation, database schema, installation workflow, group creation, configuration changes, version switching, and deployment scripts.

Big DataConfiguration ManagementDeployment

0 likes · 11 min read

Design and Implementation of Instance Configuration Management in xManager

Qunar Tech Salon

Apr 24, 2020 · Databases

Applying Apache Doris in Meituan Food Delivery Data Warehouse: Dual Engine Architecture and Performance Optimizations

The article details Meituan's food‑delivery data warehouse transformation from a MOLAP‑centric design to a dual‑engine (MOLAP + ROLAP) architecture powered by Apache Doris, describing the challenges of massive, mutable data, the technical trade‑offs, and the performance gains achieved through MPP, predicate push‑down, multi‑instance concurrency, colocate joins, and bitmap aggregation.

Apache DorisBig DataMOLAP

0 likes · 16 min read

Applying Apache Doris in Meituan Food Delivery Data Warehouse: Dual Engine Architecture and Performance Optimizations

Suning Technology

Apr 23, 2020 · Big Data

How Data Fusion Drives Retail Revival: Lessons from Suning’s Digital Transformation

The article examines China's push to develop data factor markets and showcases how Suning’s integration of online and offline data, big‑data analytics, and AI is revitalizing traditional retail, illustrating the broader impact of digital transformation on the post‑pandemic economy.

AIBig DataData Integration

0 likes · 8 min read

How Data Fusion Drives Retail Revival: Lessons from Suning’s Digital Transformation

DataFunTalk

Apr 22, 2020 · Big Data

Didi's Real-Time Computing Practices with Apache Flink: Architecture, StreamSQL, and Operational Insights

Senior Didi technology expert Liang Li-yin shares how Didi leverages Apache Flink for large‑scale real‑time computing, covering service architecture, StreamSQL advantages, multi‑cluster management, task control, monitoring, meta‑store integration, challenges, and future plans such as high availability, real‑time ML, and unified batch‑stream processing.

Apache FlinkBig DataReal‑Time Computing

0 likes · 14 min read

Didi's Real-Time Computing Practices with Apache Flink: Architecture, StreamSQL, and Operational Insights

Suning Technology

Apr 22, 2020 · Big Data

How Suning Turns Data into a New Production Factor to Revolutionize Retail

The recent Chinese policy elevating data to a production factor is illustrated by Suning’s data‑driven retail model, which uses massive user‑tag databases to create precise customer profiles, lower acquisition costs, and boost marketing efficiency, showcasing the strategic importance of big‑data assets in modern commerce.

Big DataDigital Transformationcustomer profiling

0 likes · 5 min read

How Suning Turns Data into a New Production Factor to Revolutionize Retail

Big Data Technology & Architecture

Apr 20, 2020 · Big Data

Using Window Functions in Spark SQL: Aggregation, Ranking, and Partitioning

This article introduces Spark SQL window functions, explains the difference between aggregation and window functions, and demonstrates how to use various ranking functions such as ROW_NUMBER, RANK, DENSE_RANK, and NTILE with practical Scala code examples and partitioning options.

Big DataScalaSpark

0 likes · 9 min read

Using Window Functions in Spark SQL: Aggregation, Ranking, and Partitioning

Big Data Technology & Architecture

Apr 20, 2020 · Big Data

How Spark SQL Chooses Join Strategies: Broadcast, Shuffle Hash, and Sort Merge

The article explains Spark SQL's Catalyst optimizer rules for selecting among Broadcast hash join, Shuffle hash join, and Sort‑merge join, covering build‑side determination, size thresholds, broadcast hints, local hash‑map construction, and fallback strategies for non‑equi joins.

Big DataBroadcast JoinShuffle Hash Join

0 likes · 10 min read

How Spark SQL Chooses Join Strategies: Broadcast, Shuffle Hash, and Sort Merge

Big Data Technology & Architecture

Apr 19, 2020 · Big Data

Understanding the Backpressure Mechanism in Spark Streaming

This article explains Spark Streaming's backpressure mechanism, detailing how batch intervals can cause data accumulation, the role of Receivers versus DirectKafkaInputDStream, configuration to enable backpressure, and the internal workings of RateController, ReceiverRateController, ReceiverSupervisor, BlockGenerator, and rate calculations for Kafka streams.

Big DataKafkaRateController

0 likes · 12 min read

Understanding the Backpressure Mechanism in Spark Streaming

Python Programming Learning Circle

Apr 16, 2020 · Big Data

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

This tutorial demonstrates how to initialize a SparkContext in PySpark, perform simple parallel computations such as temperature conversion and reduction, create a SparkSession to read CSV data, and apply common DataFrame operations like selecting columns, adding new columns, filtering, grouping, and aggregating.

Big DataPySparkSpark

0 likes · 5 min read

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

HomeTech

Apr 16, 2020 · Big Data

Home (ZhiJia) Distributed Task Scheduling System Overview

The article presents a comprehensive overview of the Home (ZhiJia) distributed task scheduling system, detailing its background, advantages, technology stack, architecture, core concepts, module responsibilities, IDE integration, and future improvement plans for big‑data processing workflows.

Big DataDistributed SchedulingMaster‑Slave

0 likes · 10 min read

Home (ZhiJia) Distributed Task Scheduling System Overview

dbaplus Community

Apr 15, 2020 · Big Data

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

This article details Ctrip's Hadoop evolution, the challenges of expanding across multiple data centers, the evaluation of multi‑cluster versus single‑cluster designs, and the concrete architectural changes, migration tools, bandwidth monitoring, and future plans that enabled a stable cross‑datacenter big‑data platform.

Big DataCross-DataCenterHDFS

0 likes · 19 min read

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

DataFunTalk

Apr 15, 2020 · Big Data

Apache Flink OLAP Engine: Architecture, Optimizations, and Use Cases

This article presents an in‑depth overview of Apache Flink's new OLAP engine, covering OLAP fundamentals, the three OLAP models, Flink's unified streaming‑batch‑OLAP architecture, performance optimizations, benchmark results, and future development directions.

Apache FlinkBig DataOLAP

0 likes · 11 min read

Apache Flink OLAP Engine: Architecture, Optimizations, and Use Cases

Big Data Technology & Architecture

Apr 15, 2020 · Big Data

Understanding HDFS SecondaryNameNode and the Checkpoint Process

This article explains the role of HDFS SecondaryNameNode, the structure of fsimage and edits files, how checkpointing works—including configuration parameters and steps—and how the process changes when NameNode high availability is enabled.

Big DataCheckpointFilesystem

0 likes · 6 min read

Understanding HDFS SecondaryNameNode and the Checkpoint Process

Big Data Technology Architecture

Apr 15, 2020 · Big Data

Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO

This article reviews the evolution of data warehouses from traditional offline models to modern real‑time architectures, presenting detailed case studies of Meituan, NetEase, Zhihu, and OPPO, and discusses layer designs, technology choices such as Flink, Kafka, and storage options, and key lessons for building scalable real‑time warehouses.

Big DataFlinkKafka

0 likes · 13 min read

Real-Time Data Warehouse Practices: Case Studies from Meituan, NetEase, Zhihu, and OPPO

Ops Development Stories

Apr 13, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring ELK Stack on CentOS 7

This comprehensive tutorial walks you through installing Java, Elasticsearch, Logstash, Kibana, and related tools on two CentOS 7 servers, configuring cluster settings, verifying health, and visualizing logs with Kibana, complete with command‑line examples and troubleshooting tips.

Big DataCentOSELK

0 likes · 17 min read

Step-by-Step Guide to Installing and Configuring ELK Stack on CentOS 7

Programmer DD

Apr 12, 2020 · Big Data

Master Elasticsearch: From Basics to SpringBoot Integration and Advanced Queries

This comprehensive guide introduces Elasticsearch fundamentals, its features and use cases, then walks through integrating it with SpringBoot, configuring Maven dependencies, performing index and document operations, and demonstrates a variety of query types and aggregations using both RESTful APIs and Java code examples.

Big DataElasticsearchFull‑Text Search

0 likes · 46 min read

Master Elasticsearch: From Basics to SpringBoot Integration and Advanced Queries

Amap Tech

Apr 10, 2020 · Backend Development

Platformization of POI Deep Information Integration at Amap: Design and Implementation

Amap transformed its fragmented POI deep‑information pipelines into a unified platform that automates data acquisition, parsing, dimension alignment, specification mapping, and lifecycle management across billions of records, enabling product managers to integrate, debug, and scale diverse content‑provider feeds with real‑time, end‑to‑end control.

BackendBig DataConversion Engine

0 likes · 13 min read

Platformization of POI Deep Information Integration at Amap: Design and Implementation

DataFunTalk

Apr 9, 2020 · Big Data

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

This article details how 58.com built a massive Hadoop‑based offline computing platform with over 4,000 servers and hundreds of petabytes of storage, addressing scaling, stability, GC, YARN scheduling, SparkSQL migration, storage operations, and a large‑scale cross‑datacenter migration.

Big DataData MigrationHadoop

0 likes · 24 min read

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

Meituan Technology Team

Apr 9, 2020 · Big Data

Dual-Engine MOLAP + ROLAP Architecture with Apache Doris for Meituan Takeaway Data Warehouse

Meituan Takeaway’s data warehouse combines Apache Kylin’s MOLAP cubes for stable dimensions with Apache Doris’s MPP‑driven ROLAP engine to handle changing dimensions, detail queries, and near‑real‑time analytics, achieving millisecond‑level responses, reduced storage/compute costs, and simplifying operations across diverse analytical workloads.

Apache DorisBig DataMOLAP

0 likes · 18 min read

Dual-Engine MOLAP + ROLAP Architecture with Apache Doris for Meituan Takeaway Data Warehouse

Big Data Technology & Architecture

Apr 9, 2020 · Big Data

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

The article explains how applying filters, projections, and predicate pushdown in Hadoop and Hive reduces data volume, speeds up MapReduce jobs, and improves performance, while also covering join limitations and providing a Java Mapper example for practical implementation.

Big DataHadoopMapReduce

0 likes · 4 min read

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

Big Data Technology & Architecture

Apr 8, 2020 · Big Data

Common Apache Flink Exceptions and How to Resolve Them

This article enumerates typical Apache Flink deployment, job, and checkpoint errors—such as JDK version issues, resource shortages, task manager timeouts, and state migration problems—and provides practical troubleshooting steps and configuration tips to help engineers quickly diagnose and fix these failures.

Big DataCheckpointException

0 likes · 8 min read

Common Apache Flink Exceptions and How to Resolve Them

Big Data Technology & Architecture

Apr 8, 2020 · Big Data

Spark Job Execution Principles and Parameter Tuning for Hive on Spark

This article explains how Spark jobs run on YARN, describes the impact of stages, shuffle and task parallelism, and provides detailed recommendations for tuning Spark executor, memory, core, and parallelism settings to dramatically improve Hive‑on‑Spark TPCx‑BB benchmark performance on large datasets.

Big DataParameter TuningSpark

0 likes · 12 min read

Spark Job Execution Principles and Parameter Tuning for Hive on Spark

ITPUB

Apr 6, 2020 · Big Data

How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases

This article explains the origins and market growth of data lakes, compares them with traditional data warehouses, showcases major implementations like Amazon Galaxy and Club Factory, and provides practical guidance on choosing open‑source or commercial cloud solutions to construct a data lake efficiently while minimizing risk.

AWSBig DataCloud Computing

0 likes · 10 min read

How to Build a Data Lake Quickly: Strategies, Tools, and Real‑World Cases

Big Data Technology & Architecture

Apr 2, 2020 · Big Data

Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets

This article demonstrates how to create Hive tables for student, course, teacher, and score data, generate CSV files, load them into Hive, and provides a comprehensive set of Hive SQL queries covering data retrieval, aggregation, ranking, and statistical analysis for educational datasets.

Big DataQuery Examplesdata-warehouse

0 likes · 21 min read

Hive SQL Table Creation, Data Loading, and Query Examples for Student, Course, Teacher, and Score Datasets

Big Data Technology & Architecture

Apr 1, 2020 · Big Data

HBase Cluster Deployment Architecture, Configuration Optimization, and Application Layer Usage

This article details the evolution of HBase cluster deployment from mixed‑hardware/software setups to fully independent clusters, explains hardware and software considerations, presents memory and region planning, outlines key configuration parameters, and provides Spark integration examples for batch and real‑time queries and writes.

Big DataCluster DeploymentConfiguration Optimization

0 likes · 24 min read

HBase Cluster Deployment Architecture, Configuration Optimization, and Application Layer Usage

Big Data Technology & Architecture

Mar 31, 2020 · Big Data

Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips

This article presents a detailed summary of Meituan's Spark optimization techniques, covering development‑level RDD tuning, resource parameter configuration, data‑skew mitigation, shuffle improvements, and the advantages of using DataFrame/Dataset APIs for better performance.

Big DataPerformance TuningShuffle

0 likes · 12 min read

Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips

Xianyu Technology

Mar 31, 2020 · Backend Development

Hermes Push System: Architecture and Design Overview

The Hermes Push System at Xianyu separates push decisions into three coordinated services—Configuration Center for audience and material data, Task Center for timing and orchestration, and Matching Center for real‑time content ranking—leveraging MySQL, ODPS, Flink, SchedulerX, MetaQ and Alibaba’s TPP/IGraph to boost click‑through rates, double user coverage, and achieve record daily active users, while planning to add open‑page notifications and deeper AI personalization.

AlibabaBackendBig Data

0 likes · 12 min read

Hermes Push System: Architecture and Design Overview

Big Data Technology & Architecture

Mar 30, 2020 · Databases

HBase Optimization: JVM Tuning, Region Split Policies, BlockCache, and Compaction Strategies

This guide explains how to optimize HBase performance by adjusting JVM memory settings, selecting appropriate garbage collectors, configuring MSLAB and in‑memory compaction, choosing region split policies, tuning BlockCache implementations, and applying suitable compaction policies for different workloads.

Big DataBlockCacheHBase

0 likes · 18 min read

HBase Optimization: JVM Tuning, Region Split Policies, BlockCache, and Compaction Strategies