Tagged articles
3672 articles
Page 36 of 37
ITPUB
ITPUB
Jul 19, 2016 · Big Data

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

The talk shares hands‑on experiences and best‑practice methods for traditional data‑warehouse processing, public and behavioral data handling in big‑data environments, and practical guidance for migrating legacy warehouses to modern Hadoop‑based platforms, emphasizing data governance, security, and performance optimization.

Big DataData GovernanceData Warehouse
0 likes · 13 min read
From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights
Architect
Architect
Jul 14, 2016 · Big Data

Understanding Custom Stream IDs and Topology Building in Apache Storm

This article explains how to construct Apache Storm topologies with custom stream IDs, demonstrates the classic WordCountTopology example, and provides detailed Java code snippets illustrating spout and bolt configurations, stream declarations, and grouping strategies for real‑time stream processing.

Apache StormBig DataCustom Stream ID
0 likes · 8 min read
Understanding Custom Stream IDs and Topology Building in Apache Storm
Baidu Intelligent Testing
Baidu Intelligent Testing
Jul 13, 2016 · Artificial Intelligence

Detecting Offline Merchant Service Issues Using Machine Learning and Big Data at Nuomi

The article describes how Nuomi analyzes refund and complaint data with machine‑learning and big‑data techniques, extracts features for single‑ and multi‑store scenarios, builds decision‑tree models with regional adjustments, and creates an online workflow to promptly intervene on merchants that fail to serve customers.

Big Datacustomer experiencedecision tree
0 likes · 5 min read
Detecting Offline Merchant Service Issues Using Machine Learning and Big Data at Nuomi
Efficient Ops
Efficient Ops
Jul 11, 2016 · Operations

How Tencent's Intelligent Monitoring Transforms Ops Automation

Leveraging Tencent's extensive experience in social platform operations, this talk explores intelligent monitoring practices—covering active, passive, and side‑channel techniques, full‑link observability, data processing pipelines, and alert convergence—to enhance reliability, availability, and user experience while reducing noise for ops teams.

Alert ManagementAutomationBig Data
0 likes · 22 min read
How Tencent's Intelligent Monitoring Transforms Ops Automation
ITPUB
ITPUB
Jul 10, 2016 · Big Data

Can Spark Really Process Hundreds of Terabytes Interactively?

This article examines Apache Spark's interactive mode performance, revealing that while small datasets respond within seconds, processing beyond about 1 TB dramatically increases latency, and it discusses practical limits, hardware considerations, and the need to preload large datasets from disk.

Apache SparkBig DataResponse Time
0 likes · 5 min read
Can Spark Really Process Hundreds of Terabytes Interactively?
Architecture Digest
Architecture Digest
Jul 5, 2016 · Big Data

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

The article reviews Hadoop’s origins from Google’s pioneering papers, explains its architecture and ecosystem, evaluates its strengths such as scalability and benchmarks, discusses current limitations like single‑point failures and complex programming, and outlines upcoming improvements including HDFS Federation and next‑generation MapReduce.

Big DataFutureHDFS
0 likes · 14 min read
Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop
ITPUB
ITPUB
Jun 29, 2016 · Big Data

Why OLTP Falls Short for Big Data: OLAP, Hadoop & MPP Explained

The article explains how traditional OLTP systems cannot satisfy modern big‑data analytics needs and compares OLAP, Hadoop, and MPP architectures, highlighting their data processing models, scalability, cloud‑based managed services, and practical recommendations for building effective data warehouses.

Big DataCloud ServicesData Warehouse
0 likes · 21 min read
Why OLTP Falls Short for Big Data: OLAP, Hadoop & MPP Explained
Qunar Tech Salon
Qunar Tech Salon
Jun 24, 2016 · Backend Development

Overview of Alibaba's Open Source Projects

This article provides a comprehensive overview of Alibaba's numerous open‑source projects, ranging from high‑performance service frameworks and databases to messaging middleware, frontend tools, testing platforms, and infrastructure utilities, highlighting their key features and typical use cases.

AlibabaBackendBig Data
0 likes · 22 min read
Overview of Alibaba's Open Source Projects
Efficient Ops
Efficient Ops
Jun 19, 2016 · Operations

How Real‑Time Log Analysis Is Revolutionizing IT Operations

This article summarizes a 2016 Global Operations conference talk that explains the concept of IT Operations Analytics (ITOA), its four data sources, the evolution of log management from databases to real‑time search engines, and real‑world case studies demonstrating how fast, large‑scale log analysis improves monitoring, security, and business insight.

Big DataIT Operationslog analysis
0 likes · 25 min read
How Real‑Time Log Analysis Is Revolutionizing IT Operations
21CTO
21CTO
Jun 18, 2016 · Databases

Unlock Ultra‑High Compression with HiStore’s Knowledge‑Grid Columnar Database

HiStore, Alibaba’s columnar database built on a patented Knowledge‑Grid, delivers ultra‑high compression (over 10:1, up to 40:1), low‑cost storage, rapid query performance, linear scalability, and seamless MySQL compatibility, making it ideal for massive OLAP workloads and real‑time analytics across diverse industries.

Big DataColumnar DatabaseOLAP
0 likes · 8 min read
Unlock Ultra‑High Compression with HiStore’s Knowledge‑Grid Columnar Database
21CTO
21CTO
Jun 17, 2016 · Fundamentals

2016 Programmer Salary Survey: Who Earns the Most and Emerging Tech Trends

The 2016 programmer salary report reveals that front‑end, back‑end and mobile developers dominate the workforce, big‑data engineers command the highest pay, senior engineers see sharp salary jumps, and emerging technologies like Swift, WeChat, and Python shape future career choices.

BackendBig DataMobile Development
0 likes · 8 min read
2016 Programmer Salary Survey: Who Earns the Most and Emerging Tech Trends
21CTO
21CTO
Jun 15, 2016 · Big Data

Choosing the Right Data Ingestion Tool: Flume, Fluentd, Logstash, and More

This article reviews major data collection platforms—including Apache Flume, Fluentd, Logstash, Chukwa, Scribe, and Splunk Forwarder—explaining their architectures, strengths, and limitations to help engineers select the most reliable and scalable solution for big‑data pipelines.

Apache FlumeBig DataFluentd
0 likes · 10 min read
Choosing the Right Data Ingestion Tool: Flume, Fluentd, Logstash, and More

How BitMap Accelerates Active-Day Distribution Calculations in Big Data

BitMap, a space‑saving bit‑array structure, can replace costly I/O‑heavy Spark jobs for computing user active‑day distributions by converting joins and distinct operations into fast bitwise logic, enabling efficient 30‑day rolling metrics with minimal memory and superior performance, as demonstrated by real‑world benchmarks.

Active DaysBig DataSpark
0 likes · 8 min read
How BitMap Accelerates Active-Day Distribution Calculations in Big Data
ITPUB
ITPUB
Jun 11, 2016 · Big Data

How 58 Daojia Leverages User Portraits to Boost Operations and Fight Fraud

This article details 58 Daojia's data‑driven approach to building user‑portrait tags, covering tag construction, evaluation, and practical applications such as personalized recommendations, anti‑fraud measures, coupon distribution, and dynamic pricing, while outlining the underlying big‑data architecture and technical challenges.

Big Dataanti-frauddata mining
0 likes · 18 min read
How 58 Daojia Leverages User Portraits to Boost Operations and Fight Fraud
Architecture Digest
Architecture Digest
Jun 9, 2016 · Databases

Understanding HBase Architecture and Core Principles

This article provides a comprehensive overview of HBase, covering its distributed architecture, component roles, data organization, read/write mechanisms, and best practices for schema and region design to ensure efficient big‑data storage and retrieval.

Big DataHBaseRegionServer
0 likes · 17 min read
Understanding HBase Architecture and Core Principles
Hulu Beijing
Hulu Beijing
May 31, 2016 · Big Data

What’s New in Hadoop 3.0? Key Features and Improvements Explained

Hadoop 3.0, built on JDK 1.8, adds erasure‑coded HDFS, multi‑NameNode support, native MapReduce task optimizations, cgroup‑based YARN memory and disk isolation, and container resizing, with an alpha slated for summer and a GA release expected in November or December.

Big DataHDFSHadoop
0 likes · 5 min read
What’s New in Hadoop 3.0? Key Features and Improvements Explained
dbaplus Community
dbaplus Community
May 26, 2016 · Big Data

Mastering Apache Parquet: Columnar Storage, Nested Data, and Performance Gains

This article explains Apache Parquet’s columnar storage format, its support for nested data models, the underlying striping/assembly algorithm, file structure, push‑down optimizations, and performance advantages within the Hadoop ecosystem, providing a comprehensive guide for big‑data practitioners.

Apache ParquetBig DataHadoop
0 likes · 22 min read
Mastering Apache Parquet: Columnar Storage, Nested Data, and Performance Gains
Architect
Architect
May 25, 2016 · Big Data

How Flink Manages Memory to Overcome JVM Limitations

The article explains how Flink tackles JVM memory challenges by using proactive memory management, a custom serialization framework, cache‑friendly binary operations, and off‑heap memory techniques to reduce GC pressure, avoid OOM, and improve performance in big‑data workloads.

Big DataFlinkJVM
0 likes · 17 min read
How Flink Manages Memory to Overcome JVM Limitations
Architecture Digest
Architecture Digest
May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Big DataData SkewShuffle Optimization
0 likes · 35 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
58UXD
58UXD
May 18, 2016 · Product Management

Why Companies Doubt User Research—and How to Make It Truly Valuable

This article examines why many enterprises view user research as ineffective, outlines the four biggest challenges—defining clear goals, cultivating insight, building capable teams, and adopting the right mindset—and offers practical strategies for making research results actionable, integrating them into product development, and evolving the role of user researchers.

Big DataUXagile
0 likes · 14 min read
Why Companies Doubt User Research—and How to Make It Truly Valuable
Meituan Technology Team
Meituan Technology Team
May 13, 2016 · Big Data

Spark Performance Optimization Guide: Data Skew and Shuffle Tuning

This advanced Spark performance guide explains how data skew arises during shuffles and presents eight practical solutions—including Hive preprocessing, key filtering, increased shuffle parallelism, two‑stage aggregation, map joins, sampling, random prefixes, and combined strategies—while also detailing key shuffle‑tuning parameters such as spark.shuffle.file.buffer, spark.reducer.maxSizeInFlight, and spark.shuffle.manager to improve memory usage and execution speed.

Big DataData SkewPerformance Optimization
0 likes · 33 min read
Spark Performance Optimization Guide: Data Skew and Shuffle Tuning
Qunar Tech Salon
Qunar Tech Salon
May 13, 2016 · Big Data

Overview and Architecture of Hadoop Distributed File System (HDFS)

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), detailing its design goals, architecture components such as NameNode, DataNode and SecondaryNameNode, data block handling, replication strategies, communication protocols, and the read, write, and delete processes.

Big DataDistributed File SystemHDFS
0 likes · 18 min read
Overview and Architecture of Hadoop Distributed File System (HDFS)
Efficient Ops
Efficient Ops
May 12, 2016 · Operations

How Big Data Powers Precise IT Operations for Modern Enterprises

This article explains what big data is, outlines its four V characteristics, and describes how precise IT operations—aligning services with business needs—leverage big data analytics to improve service quality, predict user behavior, and enhance competitiveness for both traditional and internet enterprises.

Big DataDigital TransformationIT Operations
0 likes · 15 min read
How Big Data Powers Precise IT Operations for Modern Enterprises
Architect
Architect
May 11, 2016 · Big Data

Comprehensive Guide to Hadoop MapReduce Job Execution, Scheduling, and Optimization

This article provides an in‑depth explanation of Hadoop MapReduce architecture, covering the roles of JobClient, JobTracker, TaskTracker and HDFS, the complete job lifecycle from submission to completion, scheduling strategies, shuffle and sort mechanisms, fault tolerance, and performance tuning techniques.

Big DataHadoopJobTracker
0 likes · 20 min read
Comprehensive Guide to Hadoop MapReduce Job Execution, Scheduling, and Optimization
Architecture Digest
Architecture Digest
May 7, 2016 · Fundamentals

Overview of Alibaba Open‑Source Projects and Tools

This article provides a comprehensive overview of numerous Alibaba open‑source projects, ranging from service frameworks like Dubbo and database tools such as Druid and OceanBase to front‑end libraries, distributed systems, testing platforms, and cloud utilities, each briefly described with links for further reference.

AlibabaBig DataJava
0 likes · 27 min read
Overview of Alibaba Open‑Source Projects and Tools
Architect
Architect
May 6, 2016 · Big Data

Integrating Kylin, Mondrian, and Saiku to Build an OLAP Analysis Tool

This article describes how the Youzan data team combined Apache Kylin, Mondrian, and Saiku into a three‑layer OLAP system, covering background, component overviews, technical architecture, schema integration challenges, count‑distinct handling, Kylin‑specific SQL quirks, and practical solutions.

Big DataHBaseHive
0 likes · 12 min read
Integrating Kylin, Mondrian, and Saiku to Build an OLAP Analysis Tool
Baidu Intelligent Testing
Baidu Intelligent Testing
May 4, 2016 · Big Data

Understanding Big Data: The Importance of Data Breadth and User Profiling for Precise Marketing and Product Optimization

The article explains the core concepts of big data, emphasizing data breadth across product lines, illustrates how comprehensive user profiling can drive personalized marketing and product improvements, and provides practical examples of cross‑product data analysis in e‑commerce, finance, travel, and gaming contexts.

Big Datacross‑product analysisdata breadth
0 likes · 5 min read
Understanding Big Data: The Importance of Data Breadth and User Profiling for Precise Marketing and Product Optimization
Meituan Technology Team
Meituan Technology Team
Apr 29, 2016 · Big Data

Introduction to Spark in Big Data

Apache Spark, a versatile big‑data platform supporting batch processing, SQL queries, real‑time streaming, and machine‑learning workloads, dramatically accelerates data‑intensive jobs, as demonstrated by Meituan‑Dianping, where its high‑performance engine reduces execution times and enhances scalability across diverse analytical and operational pipelines.

Batch ProcessingBig DataSpark
0 likes · 1 min read
Introduction to Spark in Big Data
Architecture Digest
Architecture Digest
Apr 25, 2016 · Big Data

Curated Learning Resources for Spark and Scala Beginners

This article compiles a comprehensive list of tutorials, books, online courses, and tools to help beginners get started with Apache Spark and the Scala programming language, including setup instructions, code snippets, and links to free and paid learning materials.

Big DataLearning ResourcesScala
0 likes · 7 min read
Curated Learning Resources for Spark and Scala Beginners
ITPUB
ITPUB
Apr 24, 2016 · Big Data

12 Essential Hive Performance Tips for Faster Hadoop Queries

This guide presents twelve practical Hive tuning techniques—including avoiding MapReduce, limiting string concatenation, steering clear of subqueries, choosing the right file formats, managing vectorization, sizing containers, enabling statistics, and optimizing joins—to dramatically improve query speed on Hadoop.

Big DataHadoopHive
0 likes · 7 min read
12 Essential Hive Performance Tips for Faster Hadoop Queries
Big Data and Microservices
Big Data and Microservices
Apr 21, 2016 · Information Security

How Can Banks Secure Big Data? Key Strategies for Protecting Customer Information

In the era of big data, banks face unprecedented information security challenges due to massive, valuable, and highly damaging data breaches, and must adopt encryption, flexible access control, rigorous auditing, DLP solutions, strict data management, and robust outsourcing controls to safeguard customer information.

BankingBig DataDLP
0 likes · 10 min read
How Can Banks Secure Big Data? Key Strategies for Protecting Customer Information
Big Data and Microservices
Big Data and Microservices
Apr 19, 2016 · Industry Insights

Designing a Scalable Real‑Time Stock Prediction Architecture with Open‑Source Tools

This article outlines a reference architecture for a low‑latency, horizontally scalable real‑time stock prediction system built with open‑source components such as Spring Cloud Data Flow, Apache Geode, Spark MLlib, and Hadoop, and discusses data flow steps, simplified deployment, and algorithm choices for market forecasting.

Big DataReal-TimeStock Prediction
0 likes · 7 min read
Designing a Scalable Real‑Time Stock Prediction Architecture with Open‑Source Tools
Java High-Performance Architecture
Java High-Performance Architecture
Apr 18, 2016 · Big Data

Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages

The article explains how Spark has become the leading open‑source big‑data platform, highlighting its superior speed, in‑memory processing, real‑time streaming, and built‑in machine‑learning library compared with Hadoop’s slower, disk‑based MapReduce approach and reliance on external storage and ML tools.

Big DataHadoopReal-time Processing
0 likes · 5 min read
Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages
Efficient Ops
Efficient Ops
Apr 17, 2016 · Operations

How CIOs Can Navigate Massive Technological and Industry Shifts

In this speech, former Chinese Ministry of Industry and Information Technology deputy minister Yang Xueshan outlines six strategic principles for CIOs—understanding major technological and industry trends, focusing on internal data, embracing fusion, connectivity, platforms, CPS, and intelligence, and taking practical, grounded actions to stay relevant.

Big DataCIODigital Transformation
0 likes · 18 min read
How CIOs Can Navigate Massive Technological and Industry Shifts
21CTO
21CTO
Apr 16, 2016 · Databases

Optimizing HBase Log Queries: Index Design and RowKey Strategies

This article examines the challenges of storing and querying log data in HBase, outlines the drawbacks of custom indexing, and presents practical rowKey design, filter usage, and integration with external search engines to improve query performance.

Big DataHBaseNoSQL
0 likes · 15 min read
Optimizing HBase Log Queries: Index Design and RowKey Strategies
21CTO
21CTO
Apr 14, 2016 · Big Data

How Meituan’s Data Architecture Powers Precise Mobile Marketing

This article details Meituan Dianping's data‑driven approach to precise marketing, describing the O2O marketing framework, a layered pyramid data system, profiling techniques, budget monitoring, and two real‑world case studies that together illustrate how big‑data technologies boost marketing efficiency on mobile platforms.

Big DataData Architecturemachine learning
0 likes · 12 min read
How Meituan’s Data Architecture Powers Precise Mobile Marketing
Efficient Ops
Efficient Ops
Apr 14, 2016 · Big Data

Why Big Data May Not Be the Gold Mine You Expect: Insights and Pitfalls

The article examines what big data really means, its core 4 V characteristics, current limitations in China, the overhyped value of data, the importance of business‑driven applications, and why starting from small, relevant data is essential for true predictive power.

Big DataBusiness IntelligenceData Value
0 likes · 13 min read
Why Big Data May Not Be the Gold Mine You Expect: Insights and Pitfalls
Architect
Architect
Apr 10, 2016 · Big Data

Introduction to Flume NG: Architecture, Components, Configuration, and Best Practices

This article provides a comprehensive overview of Flume NG, covering its architecture, core components (source, channel, sink), reliability mechanisms, common deployment scenarios, installation steps, configuration examples, compilation instructions, and practical best‑practice recommendations for building robust log‑collection pipelines.

ApacheBig DataConfiguration
0 likes · 16 min read
Introduction to Flume NG: Architecture, Components, Configuration, and Best Practices
Architecture Digest
Architecture Digest
Apr 9, 2016 · Big Data

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

This article describes how Meituan migrated from Hive‑SQL and MapReduce to Spark on YARN, built an interactive Zeppelin‑based development platform, created reusable ETL templates, constructed a Spark‑driven feature and data‑mining platform, and applied Spark to interactive user‑behavior analysis and large‑scale SEM services, highlighting performance gains and operational benefits.

Big DataData PlatformETL
0 likes · 19 min read
Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications
Big Data and Microservices
Big Data and Microservices
Apr 7, 2016 · Big Data

Turning Big Data into Actionable Security Visualizations: Process & Real‑World Cases

This article explains how to transform massive security‑related big data into clear visual insights, covering storytelling, data processing, visual encoding, design workflow, and two real‑world case studies that illustrate vulnerability mapping and internal traffic analysis for improved threat awareness.

Big DataData visualizationdesign process
0 likes · 10 min read
Turning Big Data into Actionable Security Visualizations: Process & Real‑World Cases
dbaplus Community
dbaplus Community
Apr 6, 2016 · Fundamentals

Essential Open‑Source Technologies Every Engineer Should Know

This article provides a comprehensive, curated overview of the most influential open‑source software across the full technology stack—including operating systems, web servers, programming languages, frameworks, databases, big‑data tools, and development utilities—offering practical insights for engineers seeking to understand and adopt proven solutions.

Big Datadatabasesopen source
0 likes · 24 min read
Essential Open‑Source Technologies Every Engineer Should Know
21CTO
21CTO
Apr 4, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop

This article details Asana's evolution from a simple Python‑MySQL setup to a robust, scalable data platform using Redshift, Hadoop, Luigi, and modern BI tools, highlighting challenges, solutions, and lessons learned for building reliable data pipelines in fast‑growing startups.

Big DataETLHadoop
0 likes · 15 min read
How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop
dbaplus Community
dbaplus Community
Apr 3, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Beyond

Facing rapid growth, Asana overhauled its data infrastructure—from a single‑machine MySQL setup to a Redshift‑backed warehouse, Hadoop‑based log processing, Luigi orchestration, and self‑service BI tools—highlighting the challenges, solutions, and future plans for scalable, reliable analytics.

Big DataBusiness IntelligenceETL
0 likes · 16 min read
How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Beyond
Architect
Architect
Apr 3, 2016 · Big Data

Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide

This article introduces Apache Flume NG, a distributed and reliable log collection system, explains its core architecture components such as Event, Flow, Agent, Source, Channel, and Sink, and provides detailed configuration examples for various pipelines, including load‑balancing, failover, and integration with HDFS.

Apache FlumeBig DataConfiguration
0 likes · 12 min read
Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide
21CTO
21CTO
Mar 31, 2016 · Big Data

Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets

Airbnb’s engineering team outlines the evolution of its big‑data platform, detailing the philosophy behind its architecture, the dual “gold” and “silver” Hive clusters, migration to Mesos, use of Presto, Airpal, Airflow, and the performance and cost gains achieved through these design choices.

AirbnbAirflowBig Data
0 likes · 11 min read
Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets
Big Data and Microservices
Big Data and Microservices
Mar 30, 2016 · Industry Insights

How Text Mining is Transforming the Securities Industry: Trends and Challenges

This article examines the rapid growth of structured and unstructured data in the securities sector, outlines text mining fundamentals, explores key algorithms and tools, and analyzes current industry services, investment communities, and professional solutions while highlighting existing challenges and future opportunities.

Big DataSentiment Analysisindustry insight
0 likes · 32 min read
How Text Mining is Transforming the Securities Industry: Trends and Challenges
Architect
Architect
Mar 29, 2016 · Big Data

Understanding Apache Storm Architecture, Stream Groupings, and the Acker Mechanism

This article provides a comprehensive overview of Apache Storm’s architecture, including the roles of Nimbus, Supervisor, and ZooKeeper, explains various stream groupings, details the Acker mechanism, and describes task execution, parallelism calculation, and internal data flow within the Storm cluster.

Apache StormBig DataReal-time analytics
0 likes · 19 min read
Understanding Apache Storm Architecture, Stream Groupings, and the Acker Mechanism
Architecture Digest
Architecture Digest
Mar 28, 2016 · Big Data

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

This article provides a comprehensive overview of Hadoop and its surrounding ecosystem, detailing core components, storage principles, key algorithms, and a wide range of modern big‑data technologies such as Spark, Flink, Kafka, NoSQL databases, and cloud‑based processing platforms.

Big DataHadoopKafka
0 likes · 11 min read
Overview of the Hadoop Ecosystem and Modern Big Data Technologies
Big Data and Microservices
Big Data and Microservices
Mar 23, 2016 · Industry Insights

Inside the Securities Tech Revolution: Cloud, Microservices, and Big Data

The article examines the paradox of the Chinese securities industry—high demand for cutting‑edge trading, quantitative and high‑frequency systems versus outdated IT—while detailing the team’s FinTech startup approach, their Node.js/Docker/MongoDB stack, a cloud‑native trading platform, microservice architecture, big‑data pipelines, performance tuning, and DevOps practices.

Big DataDevOpsFinTech
0 likes · 21 min read
Inside the Securities Tech Revolution: Cloud, Microservices, and Big Data
ITPUB
ITPUB
Mar 19, 2016 · Big Data

Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads

This article explains the fundamentals of distributed file systems, focusing on Hadoop’s HDFS architecture, the separation of metadata and data via NameNode and DataNode, and detailed step‑by‑step write and read processes, including replication, fault recovery, and block splitting across nodes.

Big DataDataNodeDistributed File System
0 likes · 8 min read
Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads
21CTO
21CTO
Mar 16, 2016 · Big Data

Inside Uber’s Tech: How Data, AI, and Cloud Power Ride‑Sharing in China

Uber’s CTO Thuan Pham revealed at a Chinese tech salon how the company’s global architecture, data‑center strategy, cloud partnership with Baidu, anti‑fraud machine‑learning models, map localization and big‑data analytics together enable a unified yet locally adapted ride‑sharing platform across China and the world.

Big DataTechnology ArchitectureUber
0 likes · 17 min read
Inside Uber’s Tech: How Data, AI, and Cloud Power Ride‑Sharing in China
Architect
Architect
Mar 10, 2016 · Big Data

Analysis and Practice of a Real-Time Hadoop Data Security Solution

The article presents a detailed technical overview of Apache Eagle's real-time Hadoop data security architecture, covering distributed data collection, stream processing, metadata‑driven policy enforcement, machine‑learning‑based anomaly detection, and integration with Hadoop ecosystem components such as HBase, Kafka, and Storm.

Apache EagleBig DataHadoop
0 likes · 25 min read
Analysis and Practice of a Real-Time Hadoop Data Security Solution
Architect
Architect
Mar 8, 2016 · Big Data

In‑Depth Analysis of Apache Kafka: Architecture, Core Concepts, and Benchmark

This article provides a comprehensive technical overview of Apache Kafka, covering its architecture, core concepts, design goals, comparison with other message queues, replication, consumer groups, delivery guarantees, and performance benchmarking, making it a valuable resource for big‑data engineers.

Big DataKafkaReplication
0 likes · 30 min read
In‑Depth Analysis of Apache Kafka: Architecture, Core Concepts, and Benchmark
Architect
Architect
Mar 6, 2016 · Big Data

Clustering Geolocated User Events with DBSCAN and Spark

This article explains how to apply the DBSCAN clustering algorithm to geolocated user event data and leverage Apache Spark’s distributed processing with PairRDDs to efficiently identify frequent user regions, detect outliers, and build location‑based services such as personalized recommendations and security alerts.

Big DataDBSCANSpark
0 likes · 8 min read
Clustering Geolocated User Events with DBSCAN and Spark
ITPUB
ITPUB
Feb 24, 2016 · Big Data

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

The article explains how Hadoop clusters suffer from resource contention among multiple users, why YARN alone often fails to prioritize workloads, and how Pepperdata provides deeper visibility and automatic adjustments that reduce low‑priority usage, cut node count, and lower cloud costs.

Big DataCluster ManagementHadoop
0 likes · 7 min read
How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance
21CTO
21CTO
Feb 23, 2016 · Big Data

Why Kafka Dominates Modern Data Pipelines: Architecture, Benefits, and Guarantees

Kafka, the open‑source distributed messaging system from LinkedIn, offers O(1) persistence, high throughput, partitioned topics, and flexible delivery guarantees, making it a cornerstone for modern big‑data pipelines and real‑time processing alongside Hadoop, Spark, and Storm.

Big DataConsumerDelivery Guarantees
0 likes · 21 min read
Why Kafka Dominates Modern Data Pipelines: Architecture, Benefits, and Guarantees
Architecture Digest
Architecture Digest
Feb 22, 2016 · Big Data

Building High‑Performance Big Data Analytics Systems: Techniques and Best Practices

An in‑depth guide outlines technology‑agnostic best‑practice techniques for building high‑performance big data analytics systems, covering data acquisition, storage, processing, visualization, and security, and explains how to address the five V’s of big data to meet demanding operational and performance requirements.

AnalyticsBig Datadata engineering
0 likes · 20 min read
Building High‑Performance Big Data Analytics Systems: Techniques and Best Practices
ITPUB
ITPUB
Feb 20, 2016 · Big Data

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

The article chronicles Doug Cutting’s path from his Stanford studies and early Xerox work through the creation of Lucene, Nutch, and Hadoop, highlighting how open‑source innovations and Google’s technologies propelled Hadoop to become a cornerstone of modern big‑data processing and its future outlook.

Big DataDoug CuttingHadoop
0 likes · 15 min read
Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era
21CTO
21CTO
Feb 14, 2016 · Big Data

How PageRank Works: From Random Surfer Theory to MapReduce Implementation

This article explains the fundamental principles of Google's PageRank algorithm, modeling web pages as a directed graph and a random surfer, discusses matrix formulation, convergence issues like dangling nodes and traps, and demonstrates a practical MapReduce implementation with Python code for large‑scale rank computation.

Big DataMapReducePageRank
0 likes · 15 min read
How PageRank Works: From Random Surfer Theory to MapReduce Implementation
21CTO
21CTO
Feb 1, 2016 · Big Data

How Solr Supercharges Real‑Time Queries in Big Data Environments

This article examines a real‑world case from Alibaba’s Taobao Jushita platform, showing how traditional SQL queries struggle with multi‑dimensional, high‑volume data and how integrating Solr’s inverted‑index search engine—combined with Hive‑generated wide tables and custom QParser plugins—delivers millisecond‑level, scalable query performance for buyer analytics.

Big DataHiveReal-time Query
0 likes · 11 min read
How Solr Supercharges Real‑Time Queries in Big Data Environments
21CTO
21CTO
Jan 25, 2016 · Big Data

How Alibaba’s Pora Powers Real‑Time Personalization at Massive Scale

Pora (Personal Offline Realtime Analyze) is a high‑throughput, low‑latency platform that captures user behavior in real time, enabling Alibaba’s search engine to deliver personalized results, support online learning, and run 24/7 with massive data volumes.

AlibabaBig DataPora
0 likes · 6 min read
How Alibaba’s Pora Powers Real‑Time Personalization at Massive Scale
21CTO
21CTO
Jan 23, 2016 · Big Data

How Massive Is the Data Behind the World’s Biggest Porn Sites?

The article analyzes the staggering traffic, storage needs, and infrastructure of major adult video platforms, revealing that sites like Xvideos and YouPorn handle tens of petabytes of data monthly, requiring bandwidth and hardware comparable to leading streaming services.

Big Datacloud storagepornography
0 likes · 8 min read
How Massive Is the Data Behind the World’s Biggest Porn Sites?
Qunar Tech Salon
Qunar Tech Salon
Jan 20, 2016 · Cloud Computing

Technical Architecture of Alipay and Ant Huabei for Large-Scale Promotional Events

The article explains how Alipay's multi-layered cloud architecture, logical data center design, distributed data strategies, and flexible transaction framework enable high availability, horizontal scalability, and rapid deployment for massive promotional traffic such as Double‑11, illustrated with the Ant Huabei case study.

AlipayBig DataDistributed Systems
0 likes · 21 min read
Technical Architecture of Alipay and Ant Huabei for Large-Scale Promotional Events
ITPUB
ITPUB
Jan 20, 2016 · Big Data

How Meizu Built an Agile Big Data Platform for Millions of Users

The Meizu Tech Open Day showcased the company's rapid evolution to a data‑driven mobile internet firm, detailing its DW1.0 and DW2.0 data‑warehouse architectures, recommendation pipelines, Spark adoption, and ELK‑based log analytics, while sharing practical lessons and future challenges.

Big DataData ArchitectureData Warehouse
0 likes · 11 min read
How Meizu Built an Agile Big Data Platform for Millions of Users
Qunar Tech Salon
Qunar Tech Salon
Jan 11, 2016 · Big Data

Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware

The article details Taobao's massive data product architecture, describing a five‑layer system that processes billions of daily records using Hadoop, real‑time streams, distributed MySQL and HBase clusters, and a middleware layer called Glider that unifies queries, caching, and front‑end integration.

Big DataData ArchitectureDistributed Systems
0 likes · 16 min read
Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware
Baidu Maps Tech Team
Baidu Maps Tech Team
Jan 6, 2016 · Big Data

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

Baidu Maps’ Data Intelligence team built a large‑scale OLAP platform using Apache Kylin, detailing the challenges of multi‑dimensional analysis on billions of rows, the architecture, custom extensions for task, resource, and monitoring management, and performance optimizations that achieve millisecond‑level SQL responses.

Apache KylinBig DataData Warehouse
0 likes · 21 min read
How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin
Efficient Ops
Efficient Ops
Jan 5, 2016 · Information Security

How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection

Apache Eagle is an open‑source, distributed, real‑time security monitoring platform for Hadoop that combines stream‑processing, scalable policy enforcement, and machine‑learning user profiling to protect massive data assets across eBay’s production clusters.

Apache EagleBig DataHadoop
0 likes · 19 min read
How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection
Architect
Architect
Jan 5, 2016 · Big Data

Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform

The article provides a comprehensive technical overview of Apache Eagle, an open‑source, distributed, real‑time security monitoring and alerting platform for Hadoop developed by eBay, covering its motivation, architecture, core components, machine‑learning based detection, typical use cases, and future development directions.

Apache EagleBig DataHadoop
0 likes · 15 min read
Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform
21CTO
21CTO
Jan 3, 2016 · Artificial Intelligence

How to Build a Real-Time Stock Prediction System with Open-Source AI and Big Data Tools

An open-source reference architecture for real-time stock prediction is presented, detailing a scalable, low-latency pipeline that captures live market data, stores it in memory, trains and applies machine learning models using Spring Cloud Data Flow, Apache Geode, Spark MLlib, and related big‑data components.

Big DataSpark MLlibSpring Cloud Data Flow
0 likes · 8 min read
How to Build a Real-Time Stock Prediction System with Open-Source AI and Big Data Tools
Architect
Architect
Dec 31, 2015 · Big Data

Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

The article explains how to leverage Apache Spark for machine‑learning tasks, large‑scale new‑word discovery, and simple intelligent question‑answering by using Spark‑Shell, Scala code, and word2vec‑based similarity, while sharing practical tips and performance considerations.

Big DataIntelligent QANew Word Discovery
0 likes · 15 min read
Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A
Architect
Architect
Dec 30, 2015 · Big Data

Real-Time Big Data Processing with Storm and Kafka on Alibaba Cloud

This article explains how to build a large‑scale, real‑time vehicle monitoring system using Apache Storm and Kafka on Alibaba Cloud, covering the challenges of big‑data ingestion, system architecture, deployment steps, performance testing, and practical lessons learned.

Alibaba CloudBig DataKafka
0 likes · 12 min read
Real-Time Big Data Processing with Storm and Kafka on Alibaba Cloud
ITPUB
ITPUB
Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataHiveSparkSQL
0 likes · 13 min read
How SparkSQL Executes Queries Faster Than Hive: A Deep Dive
21CTO
21CTO
Dec 22, 2015 · Big Data

How to Build a Scalable Distributed Web Crawler for Massive Data Harvesting

This article explains how to design and implement a distributed web‑crawling framework in Java that can collect, structure, and store massive amounts of data while handling anti‑scraping measures, duplicate detection, and real‑time monitoring.

Big DataData ExtractionJava
0 likes · 11 min read
How to Build a Scalable Distributed Web Crawler for Massive Data Harvesting
21CTO
21CTO
Dec 21, 2015 · Information Security

Why Open Source Is Becoming the Top Choice for Enterprise Security and Innovation

Over the past decade, open‑source software has surged in the enterprise sector, driven by startups and venture capital, with surveys showing widespread adoption, increased contributions, and strong security advantages that are reshaping IT architecture, cloud, and big‑data strategies.

Big DataEnterprise Softwarecloud computing
0 likes · 4 min read
Why Open Source Is Becoming the Top Choice for Enterprise Security and Innovation
21CTO
21CTO
Dec 7, 2015 · Information Security

How Tencent Combats Fraudsters with Big Data and AI‑Powered Risk Engines

This article explains how Tencent uses big‑data collection, user profiling, and AI‑driven risk learning engines to detect and block malicious accounts, proxy IPs, and fraudulent activities across e‑commerce and other platforms, detailing the architecture, algorithms, and practical defenses employed.

Big Dataanti-fraudfraud detection
0 likes · 14 min read
How Tencent Combats Fraudsters with Big Data and AI‑Powered Risk Engines
Architects Research Society
Architects Research Society
Dec 3, 2015 · Artificial Intelligence

IBM Donates SystemML to Apache Incubator, Joining the Open‑Source Machine Learning Wave

IBM announced that its SystemML machine‑learning platform will become an Apache Incubator project, highlighting a broader industry trend where tech giants like Google and Facebook open‑source their AI tools to accelerate data‑driven innovation and expand enterprise‑focused machine‑learning ecosystems.

Apache SystemMLBig DataIBM
0 likes · 5 min read
IBM Donates SystemML to Apache Incubator, Joining the Open‑Source Machine Learning Wave
ITPUB
ITPUB
Dec 3, 2015 · Databases

Choosing the Right Time‑Series Database: Types, Queries, and Performance Trade‑offs

Time‑series data, defined by a timestamp field, appears everywhere, and the article explains how to choose an appropriate time‑series database by comparing two schema models, their query patterns, performance trade‑offs, and why modern solutions like Elasticsearch, columnar stores, and Druid excel at real‑time massive aggregation.

Big DataElasticsearchSQL
0 likes · 9 min read
Choosing the Right Time‑Series Database: Types, Queries, and Performance Trade‑offs