Tagged articles

Big Data

3720 articles · Page 37 of 38

Apr 6, 2016 · Fundamentals

Essential Open‑Source Technologies Every Engineer Should Know

This article provides a comprehensive, curated overview of the most influential open‑source software across the full technology stack—including operating systems, web servers, programming languages, frameworks, databases, big‑data tools, and development utilities—offering practical insights for engineers seeking to understand and adopt proven solutions.

Big DataDatabasesopen source

0 likes · 24 min read

Essential Open‑Source Technologies Every Engineer Should Know

21CTO

Apr 4, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop

This article details Asana's evolution from a simple Python‑MySQL setup to a robust, scalable data platform using Redshift, Hadoop, Luigi, and modern BI tools, highlighting challenges, solutions, and lessons learned for building reliable data pipelines in fast‑growing startups.

Big DataData InfrastructureETL

0 likes · 15 min read

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop

dbaplus Community

Apr 3, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Beyond

Facing rapid growth, Asana overhauled its data infrastructure—from a single‑machine MySQL setup to a Redshift‑backed warehouse, Hadoop‑based log processing, Luigi orchestration, and self‑service BI tools—highlighting the challenges, solutions, and future plans for scalable, reliable analytics.

Big DataBusiness IntelligenceData Infrastructure

0 likes · 16 min read

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Beyond

Architect

Apr 3, 2016 · Big Data

Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide

This article introduces Apache Flume NG, a distributed and reliable log collection system, explains its core architecture components such as Event, Flow, Agent, Source, Channel, and Sink, and provides detailed configuration examples for various pipelines, including load‑balancing, failover, and integration with HDFS.

Apache FlumeBig DataConfiguration

0 likes · 12 min read

Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide

21CTO

Mar 31, 2016 · Big Data

Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets

Airbnb’s engineering team outlines the evolution of its big‑data platform, detailing the philosophy behind its architecture, the dual “gold” and “silver” Hive clusters, migration to Mesos, use of Presto, Airpal, Airflow, and the performance and cost gains achieved through these design choices.

AirbnbAirflowBig Data

0 likes · 11 min read

Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets

Art of Distributed System Architecture Design

Mar 31, 2016 · Big Data

Airbnb’s Big Data Platform Architecture: Design, Evolution, and Lessons Learned

Airbnb’s engineering team outlines the evolution and design of its massive big‑data platform—detailing the dual “gold” and “silver” Hive clusters, use of Kafka, Presto, Airflow, Mesos, and Spark, along with performance gains, cost reductions, and open‑source contributions.

AirbnbAirflowBig Data

0 likes · 13 min read

Airbnb’s Big Data Platform Architecture: Design, Evolution, and Lessons Learned

Big Data and Microservices

Mar 30, 2016 · Industry Insights

How Text Mining is Transforming the Securities Industry: Trends and Challenges

This article examines the rapid growth of structured and unstructured data in the securities sector, outlines text mining fundamentals, explores key algorithms and tools, and analyzes current industry services, investment communities, and professional solutions while highlighting existing challenges and future opportunities.

Big DataIndustry insightSentiment Analysis

0 likes · 32 min read

How Text Mining is Transforming the Securities Industry: Trends and Challenges

Architect

Mar 29, 2016 · Big Data

Understanding Apache Storm Architecture, Stream Groupings, and the Acker Mechanism

This article provides a comprehensive overview of Apache Storm’s architecture, including the roles of Nimbus, Supervisor, and ZooKeeper, explains various stream groupings, details the Acker mechanism, and describes task execution, parallelism calculation, and internal data flow within the Storm cluster.

Apache StormBig Datareal-time analytics

0 likes · 19 min read

Understanding Apache Storm Architecture, Stream Groupings, and the Acker Mechanism

Architecture Digest

Mar 28, 2016 · Big Data

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

This article provides a comprehensive overview of Hadoop and its surrounding ecosystem, detailing core components, storage principles, key algorithms, and a wide range of modern big‑data technologies such as Spark, Flink, Kafka, NoSQL databases, and cloud‑based processing platforms.

Big DataHadoopNoSQL

0 likes · 11 min read

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

Big Data and Microservices

Mar 23, 2016 · Industry Insights

Inside the Securities Tech Revolution: Cloud, Microservices, and Big Data

The article examines the paradox of the Chinese securities industry—high demand for cutting‑edge trading, quantitative and high‑frequency systems versus outdated IT—while detailing the team’s FinTech startup approach, their Node.js/Docker/MongoDB stack, a cloud‑native trading platform, microservice architecture, big‑data pipelines, performance tuning, and DevOps practices.

Big DataCloud ComputingFinTech

0 likes · 21 min read

Inside the Securities Tech Revolution: Cloud, Microservices, and Big Data

ITPUB

Mar 19, 2016 · Big Data

Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads

This article explains the fundamentals of distributed file systems, focusing on Hadoop’s HDFS architecture, the separation of metadata and data via NameNode and DataNode, and detailed step‑by‑step write and read processes, including replication, fault recovery, and block splitting across nodes.

Big DataDataNodeDistributed File System

0 likes · 8 min read

Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads

21CTO

Mar 16, 2016 · Big Data

Inside Uber’s Tech: How Data, AI, and Cloud Power Ride‑Sharing in China

Uber’s CTO Thuan Pham revealed at a Chinese tech salon how the company’s global architecture, data‑center strategy, cloud partnership with Baidu, anti‑fraud machine‑learning models, map localization and big‑data analytics together enable a unified yet locally adapted ride‑sharing platform across China and the world.

Big DataCloud ComputingLocalization

0 likes · 17 min read

Inside Uber’s Tech: How Data, AI, and Cloud Power Ride‑Sharing in China

Architect

Mar 10, 2016 · Big Data

Analysis and Practice of a Real-Time Hadoop Data Security Solution

The article presents a detailed technical overview of Apache Eagle's real-time Hadoop data security architecture, covering distributed data collection, stream processing, metadata‑driven policy enforcement, machine‑learning‑based anomaly detection, and integration with Hadoop ecosystem components such as HBase, Kafka, and Storm.

Apache EagleBig DataData Security

0 likes · 25 min read

Analysis and Practice of a Real-Time Hadoop Data Security Solution

Architect

Mar 8, 2016 · Big Data

Kafka Benchmark: Producer and Consumer Throughput, Replication, Message Size, and Latency Analysis

This article presents a comprehensive Kafka benchmark using six machines to evaluate producer and consumer throughput, replication effects, message size impact, and end‑to‑end latency, providing detailed results, analysis, and reproducible test commands.

Big DataLatencyThroughput

0 likes · 12 min read

Kafka Benchmark: Producer and Consumer Throughput, Replication, Message Size, and Latency Analysis

Architect

Mar 8, 2016 · Big Data

In‑Depth Analysis of Apache Kafka: Architecture, Core Concepts, and Benchmark

This article provides a comprehensive technical overview of Apache Kafka, covering its architecture, core concepts, design goals, comparison with other message queues, replication, consumer groups, delivery guarantees, and performance benchmarking, making it a valuable resource for big‑data engineers.

Big DataStreamingkafka

0 likes · 30 min read

Architect

Mar 6, 2016 · Big Data

Clustering Geolocated User Events with DBSCAN and Spark

This article explains how to apply the DBSCAN clustering algorithm to geolocated user event data and leverage Apache Spark’s distributed processing with PairRDDs to efficiently identify frequent user regions, detect outliers, and build location‑based services such as personalized recommendations and security alerts.

Big DataClusteringDBSCAN

0 likes · 8 min read

Clustering Geolocated User Events with DBSCAN and Spark

ITPUB

Feb 24, 2016 · Big Data

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

The article explains how Hadoop clusters suffer from resource contention among multiple users, why YARN alone often fails to prioritize workloads, and how Pepperdata provides deeper visibility and automatic adjustments that reduce low‑priority usage, cut node count, and lower cloud costs.

Big DataHadoopPepperdata

0 likes · 7 min read

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

Architecture Digest

Feb 23, 2016 · Databases

Highlights from SDCC 2015 Database Practice Forum: Distributed Database Technologies and Real-World Implementations

The article reviews eight expert presentations from the 2015 SDCC Database Practice Forum, covering distributed database architectures, performance tuning, high‑availability solutions, and practical case studies from leading Chinese internet companies.

Big DataHigh AvailabilityNoSQL

0 likes · 9 min read

Highlights from SDCC 2015 Database Practice Forum: Distributed Database Technologies and Real-World Implementations

21CTO

Feb 23, 2016 · Big Data

Why Kafka Dominates Modern Data Pipelines: Architecture, Benefits, and Guarantees

Kafka, the open‑source distributed messaging system from LinkedIn, offers O(1) persistence, high throughput, partitioned topics, and flexible delivery guarantees, making it a cornerstone for modern big‑data pipelines and real‑time processing alongside Hadoop, Spark, and Storm.

Big DataDelivery GuaranteesDistributed Messaging

0 likes · 21 min read

Why Kafka Dominates Modern Data Pipelines: Architecture, Benefits, and Guarantees

Architecture Digest

Feb 22, 2016 · Big Data

Building High‑Performance Big Data Analytics Systems: Techniques and Best Practices

An in‑depth guide outlines technology‑agnostic best‑practice techniques for building high‑performance big data analytics systems, covering data acquisition, storage, processing, visualization, and security, and explains how to address the five V’s of big data to meet demanding operational and performance requirements.

AnalyticsBig DataData Engineering

0 likes · 20 min read

Building High‑Performance Big Data Analytics Systems: Techniques and Best Practices

ITPUB

Feb 20, 2016 · Big Data

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

The article chronicles Doug Cutting’s path from his Stanford studies and early Xerox work through the creation of Lucene, Nutch, and Hadoop, highlighting how open‑source innovations and Google’s technologies propelled Hadoop to become a cornerstone of modern big‑data processing and its future outlook.

Big DataDistributed ComputingDoug Cutting

0 likes · 15 min read

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

21CTO

Feb 14, 2016 · Big Data

How PageRank Works: From Random Surfer Theory to MapReduce Implementation

This article explains the fundamental principles of Google's PageRank algorithm, modeling web pages as a directed graph and a random surfer, discusses matrix formulation, convergence issues like dangling nodes and traps, and demonstrates a practical MapReduce implementation with Python code for large‑scale rank computation.

Big DataMapReducePageRank

0 likes · 15 min read

How PageRank Works: From Random Surfer Theory to MapReduce Implementation

Qunar Tech Salon

Feb 14, 2016 · Big Data

Accelerating Real‑Time Data Queries with Solr in Alibaba's Jushita Platform

This article explains how Alibaba's Jushita platform leverages Apache Solr with a wide‑table data model and a custom QParser plugin to achieve real‑time, multi‑dimensional buyer filtering that traditional relational databases cannot handle efficiently in big‑data scenarios.

Big DataReal-time QuerySearch Engine

0 likes · 10 min read

Accelerating Real‑Time Data Queries with Solr in Alibaba's Jushita Platform

Alibaba Cloud Infrastructure

Feb 14, 2016 · Big Data

Small Data vs. Big Data: How Minor Signals Guide Robust Data Management

The article explains why small data are essential for avoiding common big‑data mining traps, illustrates pitfalls through real‑world examples, and offers practical methods—incremental improvement, analogical reasoning, and simple modeling—to harness weak signals for more reliable decision‑making.

Bayes theoremBig Datacausality

0 likes · 11 min read

Small Data vs. Big Data: How Minor Signals Guide Robust Data Management

21CTO

Feb 1, 2016 · Big Data

How Solr Supercharges Real‑Time Queries in Big Data Environments

This article examines a real‑world case from Alibaba’s Taobao Jushita platform, showing how traditional SQL queries struggle with multi‑dimensional, high‑volume data and how integrating Solr’s inverted‑index search engine—combined with Hive‑generated wide tables and custom QParser plugins—delivers millisecond‑level, scalable query performance for buyer analytics.

Big DataHiveReal-time Query

0 likes · 11 min read

How Solr Supercharges Real‑Time Queries in Big Data Environments

21CTO

Jan 25, 2016 · Big Data

How Alibaba’s Pora Powers Real‑Time Personalization at Massive Scale

Pora (Personal Offline Realtime Analyze) is a high‑throughput, low‑latency platform that captures user behavior in real time, enabling Alibaba’s search engine to deliver personalized results, support online learning, and run 24/7 with massive data volumes.

AlibabaBig DataPora

0 likes · 6 min read

How Alibaba’s Pora Powers Real‑Time Personalization at Massive Scale

Java High-Performance Architecture

Jan 24, 2016 · Big Data

MapReduce Explained: From Library Book Counting to Word Count in Big Data

This article introduces the MapReduce parallel processing model, illustrates its core map and reduce operations with a library‑shelf analogy and a classic word‑count example, and walks through each processing stage using clear diagrams to show how massive data is aggregated efficiently.

Big DataHadoopMapReduce

0 likes · 5 min read

MapReduce Explained: From Library Book Counting to Word Count in Big Data

21CTO

Jan 23, 2016 · Big Data

How Massive Is the Data Behind the World’s Biggest Porn Sites?

The article analyzes the staggering traffic, storage needs, and infrastructure of major adult video platforms, revealing that sites like Xvideos and YouPorn handle tens of petabytes of data monthly, requiring bandwidth and hardware comparable to leading streaming services.

Big Datacloud storagepornography

0 likes · 8 min read

How Massive Is the Data Behind the World’s Biggest Porn Sites?

Qunar Tech Salon

Jan 20, 2016 · Cloud Computing

Technical Architecture of Alipay and Ant Huabei for Large-Scale Promotional Events

The article explains how Alipay's multi-layered cloud architecture, logical data center design, distributed data strategies, and flexible transaction framework enable high availability, horizontal scalability, and rapid deployment for massive promotional traffic such as Double‑11, illustrated with the Ant Huabei case study.

AlipayBig Datacloud architecture

0 likes · 21 min read

Technical Architecture of Alipay and Ant Huabei for Large-Scale Promotional Events

ITPUB

Jan 20, 2016 · Big Data

How Meizu Built an Agile Big Data Platform for Millions of Users

The Meizu Tech Open Day showcased the company's rapid evolution to a data‑driven mobile internet firm, detailing its DW1.0 and DW2.0 data‑warehouse architectures, recommendation pipelines, Spark adoption, and ELK‑based log analytics, while sharing practical lessons and future challenges.

Big DataData ArchitectureData Warehouse

0 likes · 11 min read

How Meizu Built an Agile Big Data Platform for Millions of Users

Java High-Performance Architecture

Jan 11, 2016 · Big Data

How HDFS Powers Scalable, Reliable Storage in Big Data Environments

This article explains how HDFS abstracts multiple servers into a single file system, splits files into replicated blocks, manages metadata via NameNode and DataNode, and provides linear capacity scaling and high reliability for big data workloads.

Big DataData ReplicationDistributed File System

0 likes · 5 min read

How HDFS Powers Scalable, Reliable Storage in Big Data Environments

Qunar Tech Salon

Jan 11, 2016 · Big Data

Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware

The article details Taobao's massive data product architecture, describing a five‑layer system that processes billions of daily records using Hadoop, real‑time streams, distributed MySQL and HBase clusters, and a middleware layer called Glider that unifies queries, caching, and front‑end integration.

Big DataData ArchitectureHadoop

0 likes · 16 min read

Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware

Baidu Maps Tech Team

Jan 6, 2016 · Big Data

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

Baidu Maps’ Data Intelligence team built a large‑scale OLAP platform using Apache Kylin, detailing the challenges of multi‑dimensional analysis on billions of rows, the architecture, custom extensions for task, resource, and monitoring management, and performance optimizations that achieve millisecond‑level SQL responses.

Apache KylinBig DataData Warehouse

0 likes · 21 min read

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

21CTO

Jan 6, 2016 · Big Data

How Taobao Scales Massive Data Products: Architecture Behind 1.5PB Daily Processing

This article explains how Taobao processes over 1.5 PB of daily data through a five‑layer architecture, combining batch Hadoop jobs, a streaming platform, distributed MySQL and HBase storage, and a unified caching middle layer to deliver fast, scalable data services.

Big DataCaching

0 likes · 15 min read

How Taobao Scales Massive Data Products: Architecture Behind 1.5PB Daily Processing

Efficient Ops

Jan 5, 2016 · Information Security

How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection

Apache Eagle is an open‑source, distributed, real‑time security monitoring platform for Hadoop that combines stream‑processing, scalable policy enforcement, and machine‑learning user profiling to protect massive data assets across eBay’s production clusters.

Apache EagleBig DataHadoop

0 likes · 19 min read

How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection

Architect

Jan 5, 2016 · Big Data

Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform

The article provides a comprehensive technical overview of Apache Eagle, an open‑source, distributed, real‑time security monitoring and alerting platform for Hadoop developed by eBay, covering its motivation, architecture, core components, machine‑learning based detection, typical use cases, and future development directions.

Apache EagleBig DataData Security

0 likes · 15 min read

Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform

21CTO

Jan 3, 2016 · Artificial Intelligence

How to Build a Real-Time Stock Prediction System with Open-Source AI and Big Data Tools

An open-source reference architecture for real-time stock prediction is presented, detailing a scalable, low-latency pipeline that captures live market data, stores it in memory, trains and applies machine learning models using Spring Cloud Data Flow, Apache Geode, Spark MLlib, and related big‑data components.

Big DataSpark MLlibSpring Cloud Data Flow

0 likes · 8 min read

How to Build a Real-Time Stock Prediction System with Open-Source AI and Big Data Tools

Architect

Dec 31, 2015 · Big Data

Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

The article explains how to leverage Apache Spark for machine‑learning tasks, large‑scale new‑word discovery, and simple intelligent question‑answering by using Spark‑Shell, Scala code, and word2vec‑based similarity, while sharing practical tips and performance considerations.

Big DataIntelligent QANew Word Discovery

0 likes · 15 min read

Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

Architect

Dec 30, 2015 · Big Data

Real-Time Big Data Processing with Storm and Kafka on Alibaba Cloud

This article explains how to build a large‑scale, real‑time vehicle monitoring system using Apache Storm and Kafka on Alibaba Cloud, covering the challenges of big‑data ingestion, system architecture, deployment steps, performance testing, and practical lessons learned.

Alibaba CloudBig DataStorm

0 likes · 12 min read

Real-Time Big Data Processing with Storm and Kafka on Alibaba Cloud

Architects Research Society

Dec 30, 2015 · Artificial Intelligence

IBM Watson Personality Insights: How AI Analyzes Social Media Language to Infer Traits

The article explains how IBM's Watson uses AI and big‑data techniques to examine the words people write on platforms like Twitter and Facebook, extracting personality traits such as openness and neuroticism, and discusses the potential business uses and privacy concerns of this technology.

AIBig DataPersonality Analysis

0 likes · 9 min read

IBM Watson Personality Insights: How AI Analyzes Social Media Language to Infer Traits

ITPUB

Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataDistributed ComputingHive

0 likes · 13 min read

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

21CTO

Dec 22, 2015 · Big Data

How to Build a Scalable Distributed Web Crawler for Massive Data Harvesting

This article explains how to design and implement a distributed web‑crawling framework in Java that can collect, structure, and store massive amounts of data while handling anti‑scraping measures, duplicate detection, and real‑time monitoring.

Big DataJavadata extraction

0 likes · 11 min read

How to Build a Scalable Distributed Web Crawler for Massive Data Harvesting

21CTO

Dec 21, 2015 · Information Security

Why Open Source Is Becoming the Top Choice for Enterprise Security and Innovation

Over the past decade, open‑source software has surged in the enterprise sector, driven by startups and venture capital, with surveys showing widespread adoption, increased contributions, and strong security advantages that are reshaping IT architecture, cloud, and big‑data strategies.

Big DataCloud ComputingVenture Capital

0 likes · 4 min read

Why Open Source Is Becoming the Top Choice for Enterprise Security and Innovation

Architect

Dec 18, 2015 · Big Data

Understanding Apache Kafka’s High‑Throughput Architecture and Performance Optimizations

This article explains Apache Kafka’s core concepts, high‑throughput design choices such as sequential I/O, PageCache, Sendfile, and partitioning, and provides practical performance tips and configuration recommendations for brokers, producers, and consumers in large‑scale data pipelines.

Big DataDistributed Messagingarchitecture

0 likes · 16 min read

Understanding Apache Kafka’s High‑Throughput Architecture and Performance Optimizations

21CTO

Dec 7, 2015 · Information Security

How Tencent Combats Fraudsters with Big Data and AI‑Powered Risk Engines

This article explains how Tencent uses big‑data collection, user profiling, and AI‑driven risk learning engines to detect and block malicious accounts, proxy IPs, and fraudulent activities across e‑commerce and other platforms, detailing the architecture, algorithms, and practical defenses employed.

Big Dataanti-fraudfraud detection

0 likes · 14 min read

How Tencent Combats Fraudsters with Big Data and AI‑Powered Risk Engines

21CTO

Dec 7, 2015 · Operations

Inside JD.com’s ‘Qinglong’ Logistics Engine: Architecture, AI, and O2O Innovations

This article dissects JD.com’s Qinglong logistics system, detailing its O2O strategy, big‑data‑driven pre‑sorting, AI algorithms, GIS integration, and the evolution from version 1.0 to 3.0, highlighting how these technologies enable ultra‑fast, agile supply‑chain operations.

AIBig DataJD.com

0 likes · 12 min read

Inside JD.com’s ‘Qinglong’ Logistics Engine: Architecture, AI, and O2O Innovations

Architects Research Society

Dec 3, 2015 · Artificial Intelligence

IBM Donates SystemML to Apache Incubator, Joining the Open‑Source Machine Learning Wave

IBM announced that its SystemML machine‑learning platform will become an Apache Incubator project, highlighting a broader industry trend where tech giants like Google and Facebook open‑source their AI tools to accelerate data‑driven innovation and expand enterprise‑focused machine‑learning ecosystems.

Apache SystemMLBig DataIBM

0 likes · 5 min read

IBM Donates SystemML to Apache Incubator, Joining the Open‑Source Machine Learning Wave

ITPUB

Dec 3, 2015 · Databases

Choosing the Right Time‑Series Database: Types, Queries, and Performance Trade‑offs

Time‑series data, defined by a timestamp field, appears everywhere, and the article explains how to choose an appropriate time‑series database by comparing two schema models, their query patterns, performance trade‑offs, and why modern solutions like Elasticsearch, columnar stores, and Druid excel at real‑time massive aggregation.

AggregationBig DataElasticsearch

0 likes · 9 min read

Choosing the Right Time‑Series Database: Types, Queries, and Performance Trade‑offs

Architect

Dec 2, 2015 · Big Data

Designing an Agile Data Warehouse Architecture for Internet Companies

The article outlines a practical, end‑to‑end data platform architecture for internet businesses, covering data collection, storage and analysis, sharing, real‑time processing, task scheduling, and the importance of simplicity and agility in building an agile data warehouse.

Big DataData ArchitectureData Warehouse

0 likes · 10 min read

Designing an Agile Data Warehouse Architecture for Internet Companies

21CTO

Dec 1, 2015 · Big Data

How to Build a Real‑Time Price Update System for Billion‑Item E‑Commerce

This article explains the design of a distributed, real‑time price‑update service that handles massive product data, combines query‑driven crawling, observer‑pattern notifications, and multiple data sources to keep e‑commerce price and inventory information fresh within minutes.

Big DataReal-time Datadistributed architecture

0 likes · 14 min read

How to Build a Real‑Time Price Update System for Billion‑Item E‑Commerce

Art of Distributed System Architecture Design

Nov 30, 2015 · Big Data

LinkedIn’s Kafka at Scale: Architecture, Optimizations, and Operational Practices

The article details how LinkedIn has scaled Kafka from handling billions to trillions of messages daily, describing quota enforcement, a ZooKeeper‑free consumer, reliability enhancements, security plans, monitoring frameworks, fault‑injection testing, cluster balancing, and integration with other internal data systems.

Big DataLinkedInMonitoring

0 likes · 12 min read

LinkedIn’s Kafka at Scale: Architecture, Optimizations, and Operational Practices

Efficient Ops

Nov 29, 2015 · Big Data

Memory Computing vs Big Data: Trends, Platforms, and Architecture Choices

This article summarizes a WeChat group Q&A on the current momentum of in‑memory computing, compares TimesTen and SAP HANA, and offers practical advice on building enterprise big‑data platforms, covering cloud vs self‑build, talent, investment, and real‑world case studies.

Big Dataarchitecturein-memory databases

0 likes · 11 min read

Memory Computing vs Big Data: Trends, Platforms, and Architecture Choices

21CTO

Nov 27, 2015 · Fundamentals

What Tech Stack Powers the Most Successful Startups? Insights from AngelList Data

A recent study analyzes startup technology choices, revealing the most popular programming languages, frontend frameworks, databases, mobile platforms, infrastructure services, DevOps tools, search technologies, API integrations, and advanced big‑data solutions across different performance tiers.

Big Datafrontendprogramming languages

0 likes · 5 min read

What Tech Stack Powers the Most Successful Startups? Insights from AngelList Data

Efficient Ops

Nov 26, 2015 · Big Data

Expert Insights on User Profiling and Stream Processing in Big Data

This article presents expert Q&A on effective user behavior analysis techniques for building detailed user profiles and compares mainstream stream‑processing solutions, outlining key factors such as latency, throughput, parallelism, and fault tolerance for selecting the right real‑time data platform.

Big Datastream processinguser profiling

0 likes · 11 min read

Expert Insights on User Profiling and Stream Processing in Big Data

21CTO

Nov 26, 2015 · Big Data

How Taobao Scales Massive Data Products: Architecture Insights from Data Cube

This article explores Taobao's massive data product architecture, detailing its five-layer design, the use of Hadoop and real‑time systems, hybrid relational and NoSQL storage, a middleware layer for data integration, and systematic caching strategies that enable petabyte‑scale analytics and fast query responses.

Big DataCachingstorage

0 likes · 16 min read

How Taobao Scales Massive Data Products: Architecture Insights from Data Cube

21CTO

Nov 23, 2015 · Big Data

How Dianping Scales Real‑Time Analytics with Apache Storm

This article explains how Dianping built a millisecond‑level real‑time computation platform using Apache Storm, covering use cases, system architecture, core Storm concepts, performance tuning, best practices, and a detailed Q&A on their production deployment.

Apache StormBig DataPerformance Tuning

0 likes · 23 min read

How Dianping Scales Real‑Time Analytics with Apache Storm

Art of Distributed System Architecture Design

Nov 23, 2015 · Big Data

How Spark Enables Real‑Time Microservice Performance Tracing in the Cloud

This article explains how IBM Research leverages Spark to capture and analyze network traffic of microservice‑based applications in an OpenStack cloud, providing real‑time transaction tracing and batch latency statistics to reveal service dependencies and performance bottlenecks.

Big DataCloudMicroservices

0 likes · 8 min read

How Spark Enables Real‑Time Microservice Performance Tracing in the Cloud

21CTO

Nov 19, 2015 · Big Data

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

This article surveys the evolution of Hadoop and its ecosystem, explains core storage and processing concepts, and introduces contemporary big‑data technologies such as Spark, Flink, Kafka, Lambda architecture, NoSQL databases, and cloud‑native solutions, highlighting their roles and trade‑offs.

Big DataFlinkHadoop

0 likes · 17 min read

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

Architect

Nov 19, 2015 · Cloud Computing

Alibaba Cloud Enterprise Architecture Behind Double 11: A Deep Dive into Scalable Cloud Computing

The article details how Alibaba Cloud's multi‑layered enterprise architecture, built on service‑oriented frameworks, distributed databases, and message queues, enabled record‑breaking Double 11 transactions while offering linear performance scaling, high reliability, and cost‑effective operations for large‑scale internet applications.

Alibaba CloudBig DataEnterprise Architecture

0 likes · 8 min read

Alibaba Cloud Enterprise Architecture Behind Double 11: A Deep Dive into Scalable Cloud Computing

21CTO

Nov 13, 2015 · Artificial Intelligence

7 Essential Python Tools Every Data Scientist Should Master

Aspiring data specialists should cultivate curiosity and hands‑on experience with production‑grade tools, and this guide highlights seven indispensable Python libraries—IPython, GraphLab Create, pandas, PuLP, matplotlib, scikit‑learn, and Spark—each explained with key features to boost your data‑science career.

Big DataPythondata analysis

0 likes · 9 min read

7 Essential Python Tools Every Data Scientist Should Master

Architect

Nov 9, 2015 · Big Data

Modeling User Relationships and Information Propagation on Weibo

The article presents a comprehensive analysis of Weibo's social graph, introducing metrics such as propagation power, intimacy, fan and follow similarity, two‑degree relationships, and relationship circles to model and quantify user interactions and information diffusion within the platform.

Big DataUser RelationshipWeibo

0 likes · 13 min read

Modeling User Relationships and Information Propagation on Weibo

21CTO

Nov 4, 2015 · Big Data

How We Built a Real‑Time Log Analytics Platform with Storm and Cardinality Counting

To monitor hundreds of web apps on UAE’s PaaS platform in near‑real time, we combined Storm with lightweight log transport, a memcached‑based fqueue, and adaptive cardinality counting to efficiently compute PV, UV, response times, and custom metrics while handling cross‑cluster log aggregation.

Big DataCardinality countingLog Processing

0 likes · 9 min read

How We Built a Real‑Time Log Analytics Platform with Storm and Cardinality Counting

21CTO

Oct 26, 2015 · Big Data

Why the Internet May Fade: The Rise of the Internet of Things

The article explores Eric Schmidt's bold claim that the traditional Internet will disappear, outlines how the Internet of Things is poised to dominate with massive market potential, highlights major tech companies' IoT strategies, compares IoT with the Internet, and details the key technologies driving this new ecosystem.

Big DataIoTinternet of things

0 likes · 11 min read

Why the Internet May Fade: The Rise of the Internet of Things

Architect

Oct 17, 2015 · Big Data

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Big DataData WarehouseHadoop

0 likes · 12 min read

Designing an Agile Data Warehouse and Data Platform for Internet Companies

Qunar Tech Salon

Oct 16, 2015 · Databases

Choosing the Right NoSQL Database: MongoDB, Cassandra, and HBase Compared

The article examines why enterprises should consider NoSQL over Hadoop for big data storage, compares the three leading NoSQL databases—MongoDB, Cassandra, and HBase—based on market popularity, technical strengths, scalability, and use‑case suitability, and concludes with guidance on selecting the most appropriate solution.

Big DataCassandraMongoDB

0 likes · 11 min read

Choosing the Right NoSQL Database: MongoDB, Cassandra, and HBase Compared

Art of Distributed System Architecture Design

Oct 10, 2015 · Artificial Intelligence

Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters

This article describes how Yahoo integrated deep learning into its massive Hadoop ecosystem by adding GPU nodes, using YARN and Spark to run Caffe at scale, and presents performance results on AlexNet and GoogLeNet alongside open‑source contributions.

Big DataCaffeGPU

0 likes · 9 min read

Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters

21CTO

Sep 28, 2015 · Cloud Computing

How Airbnb Scales on AWS: Cloud Architecture, Big Data, and Machine Learning Insights

Airbnb leverages AWS, Hadoop, Presto, Airflow, and custom machine‑learning tools to power its global marketplace, optimizing search, pricing, and data pipelines while achieving significant cost savings and operational efficiency.

AWSAirflowBig Data

0 likes · 7 min read

How Airbnb Scales on AWS: Cloud Architecture, Big Data, and Machine Learning Insights

Art of Distributed System Architecture Design

Sep 25, 2015 · Big Data

Understanding Storm: A Distributed Real-Time Computation System

The article explains the need for low‑latency, high‑performance, distributed real‑time processing, outlines the challenges such systems must address, and introduces Storm as a Hadoop‑like framework for stream processing, detailing its architecture, fault‑tolerance mechanisms, transactional topology, and large‑scale deployment at Taobao.

Big DataReal-time ProcessingStorm

0 likes · 14 min read

Understanding Storm: A Distributed Real-Time Computation System

21CTO

Sep 24, 2015 · Big Data

Comparing Apache Storm, Spark, and Samza: Which Real‑Time Stream Processor Fits Your Needs?

Apache Storm, Spark Streaming, and Samza are three open‑source, low‑latency, scalable distributed systems for real‑time data processing; this article outlines their architectures, key concepts, differences in data handling, state management, delivery guarantees, and typical use‑cases to help you choose the right framework.

Apache SamzaApache StormBig Data

0 likes · 7 min read

Comparing Apache Storm, Spark, and Samza: Which Real‑Time Stream Processor Fits Your Needs?

Art of Distributed System Architecture Design

Sep 24, 2015 · Big Data

Comparative Overview of Apache Storm, Spark Streaming, and Samza for Real-Time Data Processing

This article introduces Apache Storm, Spark Streaming, and Samza, explains their architectures, common features, key differences such as delivery guarantees and state management, and provides guidance on selecting the most suitable framework for various real‑time big‑data use cases.

Apache StormBig DataComparison

0 likes · 8 min read

Comparative Overview of Apache Storm, Spark Streaming, and Samza for Real-Time Data Processing

Art of Distributed System Architecture Design

Sep 23, 2015 · Big Data

Overview of Open-Source Real-Time Stream Processing Systems

This article provides a concise overview of several open‑source real‑time stream processing platforms—including S4, Storm, StreamBase, HStreaming, Esper/NEsper, Kafka, Scribe, and Flume—highlighting their main features, programming languages, and project links for further reference.

Big DataReal-timeStorm

0 likes · 5 min read

Overview of Open-Source Real-Time Stream Processing Systems

Efficient Ops

Sep 21, 2015 · Operations

How OWL Redefines Enterprise Monitoring with Dynamic Alerts and Scalable Architecture

This article introduces OWL, a distributed, enterprise‑grade monitoring solution that combines infrastructure and business metrics, offers floating alert rules, customizable dashboards, visual asset management, a resilient Golang‑based agent, and a parallel‑scalable HBase storage backend.

AlertingBig DataCloud Native

0 likes · 12 min read

How OWL Redefines Enterprise Monitoring with Dynamic Alerts and Scalable Architecture

21CTO

Sep 19, 2015 · Artificial Intelligence

Why Distributed Machine Learning Needs More Data Than Speed

The article explains how distributed machine learning evolved from parallel computing to handle massive, long‑tail data sets, discusses the importance of scalability, fault recovery, and data‑parallel algorithms, and reviews frameworks such as MPI, MapReduce, and Pregel for building large‑scale AI systems.

Big DataLDAMPI

0 likes · 24 min read

Why Distributed Machine Learning Needs More Data Than Speed

Architect

Sep 18, 2015 · Big Data

Web Data Mining and Analysis of the “Da Gai Er” Section of the Caoliu Forum Using PHP

This article presents a PHP‑based web‑scraping experiment that collects and visualizes several months of data from the “Da Gai Er” board of the Caoliu forum, revealing user activity patterns, image hosting distribution, registration trends, and overall forum health through charts and statistical summaries.

Big DataPHPWeb Scraping

0 likes · 7 min read

Web Data Mining and Analysis of the “Da Gai Er” Section of the Caoliu Forum Using PHP

21CTO

Sep 14, 2015 · Backend Development

Why Simple‑Looking Sites Like Taobao Need Hundreds of Top Engineers

Although sites like Taobao appear simple to users, they rely on massive distributed search, caching, storage, load‑balancing, CDN, logging, and data‑analysis systems that demand sophisticated backend engineering, massive infrastructure, and specialized algorithms, explaining why countless top engineers are required to keep them running.

Big DataCachingScalable Architecture

0 likes · 12 min read

Why Simple‑Looking Sites Like Taobao Need Hundreds of Top Engineers

Efficient Ops

Aug 27, 2015 · Big Data

Choosing the Right Open‑Source Big Data Stack for Advertising: Expert Insights

This article records a WeChat Q&A where industry experts discuss selecting open‑source big data solutions, advertising‑specific data scenarios, and share a practical lambda‑style platform architecture featuring Hadoop, Spark, Storm, Elasticsearch, Redis and MySQL.

AdvertisingBig DataData Platform

0 likes · 8 min read

Choosing the Right Open‑Source Big Data Stack for Advertising: Expert Insights

Qunar Tech Salon

Aug 23, 2015 · Big Data

Large‑Scale Twitter Data Collection and Analysis: From Crawling to Sentiment and Market Correlation

The article describes a two‑year, 400‑billion‑tweet crawling project, its statistical and sentiment analyses linking sleep patterns, weekdays, holidays, and market indices, and the low‑cost technical infrastructure built to store and query the massive dataset.

Big DataMarket CorrelationSentiment Analysis

0 likes · 8 min read

Large‑Scale Twitter Data Collection and Analysis: From Crawling to Sentiment and Market Correlation

Art of Distributed System Architecture Design

Aug 20, 2015 · Industry Insights

Which Ten Keywords Will Define Enterprise Software Architecture Over the Next Decade?

The article distills ten pivotal keywords—Industrial 4.0, Internet+, BFV, microservices, distributed systems, big data, multi‑screen fusion, Docker, OpenStack, and large‑platform micro‑apps—explaining how each shapes the evolution of enterprise software architecture and what challenges and opportunities they bring.

Big DataCloud ComputingEnterprise Architecture

0 likes · 11 min read

Which Ten Keywords Will Define Enterprise Software Architecture Over the Next Decade?

Qunar Tech Salon

Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib

0 likes · 5 min read

Overview of Spark Big Data Analytics Framework Components

21CTO

Aug 14, 2015 · Frontend Development

From AJAX to Node: A Journey Through Modern Web Development

Tracing the evolution of web technologies—from early AJAX challenges and jQuery’s rise, through Chrome’s dominance, GitHub’s impact, OAuth, JSON, and modern frameworks like Node.js and Bootstrap—the article reflects on how these tools reshaped frontend development and the broader software landscape.

AJAXBig DataNode.js

0 likes · 14 min read

From AJAX to Node: A Journey Through Modern Web Development

Hulu Beijing

Aug 14, 2015 · Big Data

How Voidbox Bridges Docker and YARN for Scalable Big Data Workloads

Voidbox integrates Docker containers with YARN to simplify distributed application development, improve deployment, boost cluster efficiency, and provide fault‑tolerant, DAG‑based execution modes, enabling seamless resource management for Hadoop‑based big data jobs.

Big DataCluster ComputingDAG

0 likes · 17 min read

How Voidbox Bridges Docker and YARN for Scalable Big Data Workloads

Efficient Ops

Jul 28, 2015 · Operations

How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

This article explains how Tencent Game's BlueKing platform redesigns operations by building open‑source PaaS capabilities, automating fault self‑healing, enabling fully automated game server region launches, supporting self‑service change releases, leveraging big‑data for real‑time decisions, and moving toward open‑source and hybrid‑cloud solutions.

AutomationBig Datafault-recovery

0 likes · 19 min read

How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

Art of Distributed System Architecture Design

Jul 19, 2015 · Big Data

Selection and Comparison of Big Data Benchmark Standards with a Focus on TPC‑DS

This article reviews the evolution of big‑data management technologies, discusses the criteria for choosing appropriate big‑data benchmarks, compares existing benchmarks such as MapReduce tests, YCSB, BigBench and BigFrame, and provides an in‑depth analysis of the TPC‑DS benchmark and its certification status.

Big DataData ManagementSQL

0 likes · 15 min read

Selection and Comparison of Big Data Benchmark Standards with a Focus on TPC‑DS

Architect

Jul 18, 2015 · Databases

Qihoo 360’s Use of MongoDB: Architecture, Practices, and Lessons Learned

The article details how Qihoo 360 adopted MongoDB since 2011, scaling to over 100 applications, 1,500 instances and 20 billion daily queries, and shares their architectural choices, backup strategies, best‑practice recommendations, and advice for teams considering MongoDB in large‑scale, cloud‑native environments.

Backup StrategiesBig DataDatabase Architecture

0 likes · 12 min read

Qihoo 360’s Use of MongoDB: Architecture, Practices, and Lessons Learned

Qunar Tech Salon

Jul 12, 2015 · Big Data

Airbnb OpenAir Conference: Open‑Source Tools Airpal, Aerosolve, and Airflow

At Airbnb’s inaugural OpenAir conference, the company unveiled three open‑source big‑data tools—Airpal, a Presto‑based visual SQL query engine; Aerosolve, an interpretable machine‑learning engine for pricing recommendations; and Airflow, an internal platform for orchestrating and monitoring data pipelines.

AirbnbBig DataOpenAir

0 likes · 4 min read

Airbnb OpenAir Conference: Open‑Source Tools Airpal, Aerosolve, and Airflow

Art of Distributed System Architecture Design

Jul 10, 2015 · Big Data

Improving Hive Storage Efficiency: From RCFile to ORCFile at Facebook

Facebook’s data warehouse, storing over 300 PB and growing by 600 TB daily, transitioned from the RCFile format to an optimized ORCFile implementation, achieving 5‑8× better compression and up to three‑fold faster write performance while maintaining high read efficiency.

Big DataFacebookHive

0 likes · 14 min read

Improving Hive Storage Efficiency: From RCFile to ORCFile at Facebook

Qunar Tech Salon

Jul 8, 2015 · Big Data

Understanding Logs: The Foundation of Distributed Systems, Data Integration, and Stream Processing

This article explains how logs—simple, append‑only, time‑ordered records—serve as the core abstraction behind databases, distributed systems, data integration pipelines, and modern stream‑processing platforms such as Kafka and Hadoop, illustrating their design, scalability, and practical challenges.

Big DataData IntegrationHadoop

0 likes · 45 min read

Understanding Logs: The Foundation of Distributed Systems, Data Integration, and Stream Processing

Model Perspective

Jul 6, 2015 · Big Data

What Will Future Schools Look Like? Insights from Global Education Leaders

Amid heated debate over China’s Hengshui model, educators worldwide are envisioning future schools that leverage big-data analytics, immersive technology, and flexible, student-centered learning to cultivate critical thinking, creativity, and empathy, moving beyond traditional exam-driven curricula toward personalized, interdisciplinary education.

21st century skillsBig Datafuture education

0 likes · 8 min read

What Will Future Schools Look Like? Insights from Global Education Leaders

Art of Distributed System Architecture Design

Jun 19, 2015 · Big Data

Storm vs Spark: Which Real‑Time Analytics Platform Wins for Your Business?

The article compares Apache Storm and Apache Spark, examining their origins, architecture, language support, integration capabilities, and performance characteristics, and offers guidance on selecting the right platform for real‑time business intelligence based on specific workload and infrastructure needs.

Apache SparkApache StormBig Data

0 likes · 11 min read

Storm vs Spark: Which Real‑Time Analytics Platform Wins for Your Business?

Art of Distributed System Architecture Design

Jun 16, 2015 · Big Data

Social Network Analysis on Weibo: Label Propagation, User Similarity, Community Detection, Influence Ranking, and Spam User Identification

This article introduces a series of algorithms for analyzing the Weibo social network, including label propagation, LDA‑based user similarity, time‑aware and interaction‑aware similarity measures, community detection, influence ranking via PageRank variants, and methods for identifying spam users, illustrating how these techniques can be applied to large‑scale social media data.

Big DataSocial Network Analysisinfluence ranking

0 likes · 19 min read

Social Network Analysis on Weibo: Label Propagation, User Similarity, Community Detection, Influence Ranking, and Spam User Identification

Art of Distributed System Architecture Design

Jun 15, 2015 · Big Data

Designing a Scalable Real‑Time Mobile Analytics Platform with Kafka, Storm, and Amazon EMR

The article describes how a mobile analytics service processes billions of events daily using a Lambda‑style architecture that combines Kafka, Storm, Amazon EMR, and S3 to achieve scalable, fault‑tolerant batch and real‑time computation, while ensuring reliable event ingestion and graceful degradation.

AWSBig DataStorm

0 likes · 8 min read

Designing a Scalable Real‑Time Mobile Analytics Platform with Kafka, Storm, and Amazon EMR

Architect

May 20, 2015 · Backend Development

Interview with Douban Chief Architect Hong QN: System Architecture, BeansDB, DAE, DPark and Team Practices

The interview with Douban's chief architect Hong QN details the platform's online and offline architecture, including load balancing, the DAE PaaS, the BeansDB key‑value store, the DPark big‑data processing engine, and the team organization and operational practices that support these systems.

Big DataDatabasesarchitecture

0 likes · 10 min read

Interview with Douban Chief Architect Hong QN: System Architecture, BeansDB, DAE, DPark and Team Practices

Art of Distributed System Architecture Design

May 17, 2015 · Databases

Mastering HBase: Table Structure, API Usage, and Performance Tuning

This article explains HBase's column‑oriented architecture, key concepts such as Rowkey, ColumnFamily, and Region, provides Java API examples for table operations, and offers practical optimization techniques—including pre‑splitting, Rowkey design, caching, and compaction settings—to improve read/write performance.

Big DataHBaseJava API

0 likes · 20 min read

Mastering HBase: Table Structure, API Usage, and Performance Tuning

High Availability Architecture

May 15, 2015 · Big Data

Real-Time Computing at Dianping: Architecture, Use Cases, and Best Practices

During a detailed live session, senior Dianping engineer Wang Xinchun explains the company's real‑time computing platform built on Apache Storm, covering use cases such as dashboards, search and recommendation, system architecture, data ingestion tools like Blackhole and Puma, performance tuning, monitoring, and practical best‑practice recommendations.

Apache StormBig DataReal-Time Computing

0 likes · 21 min read

Real-Time Computing at Dianping: Architecture, Use Cases, and Best Practices

Ctrip Technology

May 14, 2015 · Artificial Intelligence

Data‑Driven User Experience: Machine Learning Applications in Hotel Booking and Marketing at Ctrip

In his 2015 China Hotel Marketing Summit keynote, Ctrip CTO Ye Yamin explained how machine‑learning models built on purchase behavior and order data improve hotel room availability predictions, shorten confirmation times, personalize recommendations, and evaluate advertising effectiveness, illustrating a data‑driven approach to user experience and operations.

Big DataMarketingdata analytics

0 likes · 14 min read

Data‑Driven User Experience: Machine Learning Applications in Hotel Booking and Marketing at Ctrip

MaGe Linux Operations

Apr 28, 2015 · Big Data

How LinkedIn Scales Kafka to Billions of Messages Every Day

This article explains how LinkedIn uses Apache Kafka as a high‑throughput, fault‑tolerant messaging backbone, detailing its architecture, message categories, layered replication, audit mechanisms, and the engineering practices that keep billions of daily messages reliable and fast.

Big DataLinkedIndistributed systems

0 likes · 11 min read

How LinkedIn Scales Kafka to Billions of Messages Every Day

Art of Distributed System Architecture Design

Apr 24, 2015 · Big Data

Pinterest Real-Time Data Pipeline Using Kafka, Spark, and MemSQL

Pinterest built a real‑time data pipeline that streams user engagement events through Apache Kafka into Spark Streaming, enriches them with location and category information, and persists the results in MemSQL to enable fast, SQL‑based analytics for its recommendation engine.

Big DataMemSQLPinterest

0 likes · 3 min read

Pinterest Real-Time Data Pipeline Using Kafka, Spark, and MemSQL

Art of Distributed System Architecture Design

Apr 15, 2015 · Big Data

Understanding Stream Processing, Event Sourcing, and Complex Event Processing

The article explains the fundamentals of stream processing, event sourcing, and complex event processing, comparing raw event storage with aggregated results, illustrating architectures with Kafka, Samza, and other frameworks, and highlighting benefits such as scalability, flexibility, and decoupling for modern data‑driven systems.

Apache KafkaApache SamzaBig Data

0 likes · 11 min read

Understanding Stream Processing, Event Sourcing, and Complex Event Processing

MaGe Linux Operations

Apr 7, 2015 · Big Data

How Hadoop’s Tiered Storage Optimizes Data Based on Temperature

This article explains Hadoop’s tiered storage concept, describing how data is classified by temperature—hot, warm, cold, frozen—and automatically moved across disk and archive layers to optimize cost and performance, with examples from Hadoop versions and eBay’s large‑scale deployment.

Big DataData TemperatureHDFS

0 likes · 9 min read

How Hadoop’s Tiered Storage Optimizes Data Based on Temperature

Qunar Tech Salon

Mar 16, 2015 · Big Data

Comparison of Apache Storm, Spark Streaming, and Samza for Real‑Time Data Processing

This article introduces Apache Storm, Spark Streaming, and Apache Samza, outlines their architectures, highlights commonalities and differences such as delivery guarantees and state management, and offers guidance on selecting the most suitable framework for various real‑time big‑data use cases.

Apache SamzaApache StormBig Data

0 likes · 8 min read

Comparison of Apache Storm, Spark Streaming, and Samza for Real‑Time Data Processing