Tagged articles

Big Data

3720 articles · Page 37 of 38
dbaplus Community
dbaplus Community
Apr 6, 2016 · Fundamentals

Essential Open‑Source Technologies Every Engineer Should Know

This article provides a comprehensive, curated overview of the most influential open‑source software across the full technology stack—including operating systems, web servers, programming languages, frameworks, databases, big‑data tools, and development utilities—offering practical insights for engineers seeking to understand and adopt proven solutions.

Big DataDatabasesopen source
0 likes · 24 min read
Essential Open‑Source Technologies Every Engineer Should Know
21CTO
21CTO
Apr 4, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop

This article details Asana's evolution from a simple Python‑MySQL setup to a robust, scalable data platform using Redshift, Hadoop, Luigi, and modern BI tools, highlighting challenges, solutions, and lessons learned for building reliable data pipelines in fast‑growing startups.

Big DataData InfrastructureETL
0 likes · 15 min read
How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop
dbaplus Community
dbaplus Community
Apr 3, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Beyond

Facing rapid growth, Asana overhauled its data infrastructure—from a single‑machine MySQL setup to a Redshift‑backed warehouse, Hadoop‑based log processing, Luigi orchestration, and self‑service BI tools—highlighting the challenges, solutions, and future plans for scalable, reliable analytics.

Big DataBusiness IntelligenceData Infrastructure
0 likes · 16 min read
How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Beyond
Architect
Architect
Apr 3, 2016 · Big Data

Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide

This article introduces Apache Flume NG, a distributed and reliable log collection system, explains its core architecture components such as Event, Flow, Agent, Source, Channel, and Sink, and provides detailed configuration examples for various pipelines, including load‑balancing, failover, and integration with HDFS.

Apache FlumeBig DataConfiguration
0 likes · 12 min read
Apache Flume NG Architecture, Core Concepts, and Practical Configuration Guide
21CTO
21CTO
Mar 31, 2016 · Big Data

Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets

Airbnb’s engineering team outlines the evolution of its big‑data platform, detailing the philosophy behind its architecture, the dual “gold” and “silver” Hive clusters, migration to Mesos, use of Presto, Airpal, Airflow, and the performance and cost gains achieved through these design choices.

AirbnbAirflowBig Data
0 likes · 11 min read
Inside Airbnb’s Massive Big Data Platform: Architecture, Lessons & Scaling Secrets
Big Data and Microservices
Big Data and Microservices
Mar 30, 2016 · Industry Insights

How Text Mining is Transforming the Securities Industry: Trends and Challenges

This article examines the rapid growth of structured and unstructured data in the securities sector, outlines text mining fundamentals, explores key algorithms and tools, and analyzes current industry services, investment communities, and professional solutions while highlighting existing challenges and future opportunities.

Big DataIndustry insightSentiment Analysis
0 likes · 32 min read
How Text Mining is Transforming the Securities Industry: Trends and Challenges
Architect
Architect
Mar 29, 2016 · Big Data

Understanding Apache Storm Architecture, Stream Groupings, and the Acker Mechanism

This article provides a comprehensive overview of Apache Storm’s architecture, including the roles of Nimbus, Supervisor, and ZooKeeper, explains various stream groupings, details the Acker mechanism, and describes task execution, parallelism calculation, and internal data flow within the Storm cluster.

Apache StormBig Datareal-time analytics
0 likes · 19 min read
Understanding Apache Storm Architecture, Stream Groupings, and the Acker Mechanism
Architecture Digest
Architecture Digest
Mar 28, 2016 · Big Data

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

This article provides a comprehensive overview of Hadoop and its surrounding ecosystem, detailing core components, storage principles, key algorithms, and a wide range of modern big‑data technologies such as Spark, Flink, Kafka, NoSQL databases, and cloud‑based processing platforms.

Big DataHadoopNoSQL
0 likes · 11 min read
Overview of the Hadoop Ecosystem and Modern Big Data Technologies
Big Data and Microservices
Big Data and Microservices
Mar 23, 2016 · Industry Insights

Inside the Securities Tech Revolution: Cloud, Microservices, and Big Data

The article examines the paradox of the Chinese securities industry—high demand for cutting‑edge trading, quantitative and high‑frequency systems versus outdated IT—while detailing the team’s FinTech startup approach, their Node.js/Docker/MongoDB stack, a cloud‑native trading platform, microservice architecture, big‑data pipelines, performance tuning, and DevOps practices.

Big DataCloud ComputingFinTech
0 likes · 21 min read
Inside the Securities Tech Revolution: Cloud, Microservices, and Big Data
ITPUB
ITPUB
Mar 19, 2016 · Big Data

Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads

This article explains the fundamentals of distributed file systems, focusing on Hadoop’s HDFS architecture, the separation of metadata and data via NameNode and DataNode, and detailed step‑by‑step write and read processes, including replication, fault recovery, and block splitting across nodes.

Big DataDataNodeDistributed File System
0 likes · 8 min read
Inside HDFS: How NameNode and DataNode Manage Big Data Writes and Reads
21CTO
21CTO
Mar 16, 2016 · Big Data

Inside Uber’s Tech: How Data, AI, and Cloud Power Ride‑Sharing in China

Uber’s CTO Thuan Pham revealed at a Chinese tech salon how the company’s global architecture, data‑center strategy, cloud partnership with Baidu, anti‑fraud machine‑learning models, map localization and big‑data analytics together enable a unified yet locally adapted ride‑sharing platform across China and the world.

Big DataCloud ComputingLocalization
0 likes · 17 min read
Inside Uber’s Tech: How Data, AI, and Cloud Power Ride‑Sharing in China
Architect
Architect
Mar 10, 2016 · Big Data

Analysis and Practice of a Real-Time Hadoop Data Security Solution

The article presents a detailed technical overview of Apache Eagle's real-time Hadoop data security architecture, covering distributed data collection, stream processing, metadata‑driven policy enforcement, machine‑learning‑based anomaly detection, and integration with Hadoop ecosystem components such as HBase, Kafka, and Storm.

Apache EagleBig DataData Security
0 likes · 25 min read
Analysis and Practice of a Real-Time Hadoop Data Security Solution
Architect
Architect
Mar 8, 2016 · Big Data

In‑Depth Analysis of Apache Kafka: Architecture, Core Concepts, and Benchmark

This article provides a comprehensive technical overview of Apache Kafka, covering its architecture, core concepts, design goals, comparison with other message queues, replication, consumer groups, delivery guarantees, and performance benchmarking, making it a valuable resource for big‑data engineers.

Big DataStreamingkafka
0 likes · 30 min read
In‑Depth Analysis of Apache Kafka: Architecture, Core Concepts, and Benchmark
Architect
Architect
Mar 6, 2016 · Big Data

Clustering Geolocated User Events with DBSCAN and Spark

This article explains how to apply the DBSCAN clustering algorithm to geolocated user event data and leverage Apache Spark’s distributed processing with PairRDDs to efficiently identify frequent user regions, detect outliers, and build location‑based services such as personalized recommendations and security alerts.

Big DataClusteringDBSCAN
0 likes · 8 min read
Clustering Geolocated User Events with DBSCAN and Spark
ITPUB
ITPUB
Feb 24, 2016 · Big Data

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

The article explains how Hadoop clusters suffer from resource contention among multiple users, why YARN alone often fails to prioritize workloads, and how Pepperdata provides deeper visibility and automatic adjustments that reduce low‑priority usage, cut node count, and lower cloud costs.

Big DataHadoopPepperdata
0 likes · 7 min read
How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance
21CTO
21CTO
Feb 23, 2016 · Big Data

Why Kafka Dominates Modern Data Pipelines: Architecture, Benefits, and Guarantees

Kafka, the open‑source distributed messaging system from LinkedIn, offers O(1) persistence, high throughput, partitioned topics, and flexible delivery guarantees, making it a cornerstone for modern big‑data pipelines and real‑time processing alongside Hadoop, Spark, and Storm.

Big DataDelivery GuaranteesDistributed Messaging
0 likes · 21 min read
Why Kafka Dominates Modern Data Pipelines: Architecture, Benefits, and Guarantees
Architecture Digest
Architecture Digest
Feb 22, 2016 · Big Data

Building High‑Performance Big Data Analytics Systems: Techniques and Best Practices

An in‑depth guide outlines technology‑agnostic best‑practice techniques for building high‑performance big data analytics systems, covering data acquisition, storage, processing, visualization, and security, and explains how to address the five V’s of big data to meet demanding operational and performance requirements.

AnalyticsBig DataData Engineering
0 likes · 20 min read
Building High‑Performance Big Data Analytics Systems: Techniques and Best Practices
ITPUB
ITPUB
Feb 20, 2016 · Big Data

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

The article chronicles Doug Cutting’s path from his Stanford studies and early Xerox work through the creation of Lucene, Nutch, and Hadoop, highlighting how open‑source innovations and Google’s technologies propelled Hadoop to become a cornerstone of modern big‑data processing and its future outlook.

Big DataDistributed ComputingDoug Cutting
0 likes · 15 min read
Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era
21CTO
21CTO
Feb 14, 2016 · Big Data

How PageRank Works: From Random Surfer Theory to MapReduce Implementation

This article explains the fundamental principles of Google's PageRank algorithm, modeling web pages as a directed graph and a random surfer, discusses matrix formulation, convergence issues like dangling nodes and traps, and demonstrates a practical MapReduce implementation with Python code for large‑scale rank computation.

Big DataMapReducePageRank
0 likes · 15 min read
How PageRank Works: From Random Surfer Theory to MapReduce Implementation
21CTO
21CTO
Feb 1, 2016 · Big Data

How Solr Supercharges Real‑Time Queries in Big Data Environments

This article examines a real‑world case from Alibaba’s Taobao Jushita platform, showing how traditional SQL queries struggle with multi‑dimensional, high‑volume data and how integrating Solr’s inverted‑index search engine—combined with Hive‑generated wide tables and custom QParser plugins—delivers millisecond‑level, scalable query performance for buyer analytics.

Big DataHiveReal-time Query
0 likes · 11 min read
How Solr Supercharges Real‑Time Queries in Big Data Environments
21CTO
21CTO
Jan 25, 2016 · Big Data

How Alibaba’s Pora Powers Real‑Time Personalization at Massive Scale

Pora (Personal Offline Realtime Analyze) is a high‑throughput, low‑latency platform that captures user behavior in real time, enabling Alibaba’s search engine to deliver personalized results, support online learning, and run 24/7 with massive data volumes.

AlibabaBig DataPora
0 likes · 6 min read
How Alibaba’s Pora Powers Real‑Time Personalization at Massive Scale
21CTO
21CTO
Jan 23, 2016 · Big Data

How Massive Is the Data Behind the World’s Biggest Porn Sites?

The article analyzes the staggering traffic, storage needs, and infrastructure of major adult video platforms, revealing that sites like Xvideos and YouPorn handle tens of petabytes of data monthly, requiring bandwidth and hardware comparable to leading streaming services.

Big Datacloud storagepornography
0 likes · 8 min read
How Massive Is the Data Behind the World’s Biggest Porn Sites?
Qunar Tech Salon
Qunar Tech Salon
Jan 20, 2016 · Cloud Computing

Technical Architecture of Alipay and Ant Huabei for Large-Scale Promotional Events

The article explains how Alipay's multi-layered cloud architecture, logical data center design, distributed data strategies, and flexible transaction framework enable high availability, horizontal scalability, and rapid deployment for massive promotional traffic such as Double‑11, illustrated with the Ant Huabei case study.

AlipayBig Datacloud architecture
0 likes · 21 min read
Technical Architecture of Alipay and Ant Huabei for Large-Scale Promotional Events
ITPUB
ITPUB
Jan 20, 2016 · Big Data

How Meizu Built an Agile Big Data Platform for Millions of Users

The Meizu Tech Open Day showcased the company's rapid evolution to a data‑driven mobile internet firm, detailing its DW1.0 and DW2.0 data‑warehouse architectures, recommendation pipelines, Spark adoption, and ELK‑based log analytics, while sharing practical lessons and future challenges.

Big DataData ArchitectureData Warehouse
0 likes · 11 min read
How Meizu Built an Agile Big Data Platform for Millions of Users
Baidu Maps Tech Team
Baidu Maps Tech Team
Jan 6, 2016 · Big Data

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

Baidu Maps’ Data Intelligence team built a large‑scale OLAP platform using Apache Kylin, detailing the challenges of multi‑dimensional analysis on billions of rows, the architecture, custom extensions for task, resource, and monitoring management, and performance optimizations that achieve millisecond‑level SQL responses.

Apache KylinBig DataData Warehouse
0 likes · 21 min read
How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin
Efficient Ops
Efficient Ops
Jan 5, 2016 · Information Security

How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection

Apache Eagle is an open‑source, distributed, real‑time security monitoring platform for Hadoop that combines stream‑processing, scalable policy enforcement, and machine‑learning user profiling to protect massive data assets across eBay’s production clusters.

Apache EagleBig DataHadoop
0 likes · 19 min read
How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection
Architect
Architect
Jan 5, 2016 · Big Data

Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform

The article provides a comprehensive technical overview of Apache Eagle, an open‑source, distributed, real‑time security monitoring and alerting platform for Hadoop developed by eBay, covering its motivation, architecture, core components, machine‑learning based detection, typical use cases, and future development directions.

Apache EagleBig DataData Security
0 likes · 15 min read
Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform
21CTO
21CTO
Jan 3, 2016 · Artificial Intelligence

How to Build a Real-Time Stock Prediction System with Open-Source AI and Big Data Tools

An open-source reference architecture for real-time stock prediction is presented, detailing a scalable, low-latency pipeline that captures live market data, stores it in memory, trains and applies machine learning models using Spring Cloud Data Flow, Apache Geode, Spark MLlib, and related big‑data components.

Big DataSpark MLlibSpring Cloud Data Flow
0 likes · 8 min read
How to Build a Real-Time Stock Prediction System with Open-Source AI and Big Data Tools
Architect
Architect
Dec 31, 2015 · Big Data

Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

The article explains how to leverage Apache Spark for machine‑learning tasks, large‑scale new‑word discovery, and simple intelligent question‑answering by using Spark‑Shell, Scala code, and word2vec‑based similarity, while sharing practical tips and performance considerations.

Big DataIntelligent QANew Word Discovery
0 likes · 15 min read
Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A
Architect
Architect
Dec 30, 2015 · Big Data

Real-Time Big Data Processing with Storm and Kafka on Alibaba Cloud

This article explains how to build a large‑scale, real‑time vehicle monitoring system using Apache Storm and Kafka on Alibaba Cloud, covering the challenges of big‑data ingestion, system architecture, deployment steps, performance testing, and practical lessons learned.

Alibaba CloudBig DataStorm
0 likes · 12 min read
Real-Time Big Data Processing with Storm and Kafka on Alibaba Cloud
ITPUB
ITPUB
Dec 29, 2015 · Big Data

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

Big DataDistributed ComputingHive
0 likes · 13 min read
How SparkSQL Executes Queries Faster Than Hive: A Deep Dive
21CTO
21CTO
Dec 22, 2015 · Big Data

How to Build a Scalable Distributed Web Crawler for Massive Data Harvesting

This article explains how to design and implement a distributed web‑crawling framework in Java that can collect, structure, and store massive amounts of data while handling anti‑scraping measures, duplicate detection, and real‑time monitoring.

Big DataJavadata extraction
0 likes · 11 min read
How to Build a Scalable Distributed Web Crawler for Massive Data Harvesting
21CTO
21CTO
Dec 21, 2015 · Information Security

Why Open Source Is Becoming the Top Choice for Enterprise Security and Innovation

Over the past decade, open‑source software has surged in the enterprise sector, driven by startups and venture capital, with surveys showing widespread adoption, increased contributions, and strong security advantages that are reshaping IT architecture, cloud, and big‑data strategies.

Big DataCloud ComputingVenture Capital
0 likes · 4 min read
Why Open Source Is Becoming the Top Choice for Enterprise Security and Innovation
21CTO
21CTO
Dec 7, 2015 · Information Security

How Tencent Combats Fraudsters with Big Data and AI‑Powered Risk Engines

This article explains how Tencent uses big‑data collection, user profiling, and AI‑driven risk learning engines to detect and block malicious accounts, proxy IPs, and fraudulent activities across e‑commerce and other platforms, detailing the architecture, algorithms, and practical defenses employed.

Big Dataanti-fraudfraud detection
0 likes · 14 min read
How Tencent Combats Fraudsters with Big Data and AI‑Powered Risk Engines
Architects Research Society
Architects Research Society
Dec 3, 2015 · Artificial Intelligence

IBM Donates SystemML to Apache Incubator, Joining the Open‑Source Machine Learning Wave

IBM announced that its SystemML machine‑learning platform will become an Apache Incubator project, highlighting a broader industry trend where tech giants like Google and Facebook open‑source their AI tools to accelerate data‑driven innovation and expand enterprise‑focused machine‑learning ecosystems.

Apache SystemMLBig DataIBM
0 likes · 5 min read
IBM Donates SystemML to Apache Incubator, Joining the Open‑Source Machine Learning Wave
ITPUB
ITPUB
Dec 3, 2015 · Databases

Choosing the Right Time‑Series Database: Types, Queries, and Performance Trade‑offs

Time‑series data, defined by a timestamp field, appears everywhere, and the article explains how to choose an appropriate time‑series database by comparing two schema models, their query patterns, performance trade‑offs, and why modern solutions like Elasticsearch, columnar stores, and Druid excel at real‑time massive aggregation.

AggregationBig DataElasticsearch
0 likes · 9 min read
Choosing the Right Time‑Series Database: Types, Queries, and Performance Trade‑offs
Architect
Architect
Dec 2, 2015 · Big Data

Designing an Agile Data Warehouse Architecture for Internet Companies

The article outlines a practical, end‑to‑end data platform architecture for internet businesses, covering data collection, storage and analysis, sharing, real‑time processing, task scheduling, and the importance of simplicity and agility in building an agile data warehouse.

Big DataData ArchitectureData Warehouse
0 likes · 10 min read
Designing an Agile Data Warehouse Architecture for Internet Companies
21CTO
21CTO
Dec 1, 2015 · Big Data

How to Build a Real‑Time Price Update System for Billion‑Item E‑Commerce

This article explains the design of a distributed, real‑time price‑update service that handles massive product data, combines query‑driven crawling, observer‑pattern notifications, and multiple data sources to keep e‑commerce price and inventory information fresh within minutes.

Big DataReal-time Datadistributed architecture
0 likes · 14 min read
How to Build a Real‑Time Price Update System for Billion‑Item E‑Commerce

LinkedIn’s Kafka at Scale: Architecture, Optimizations, and Operational Practices

The article details how LinkedIn has scaled Kafka from handling billions to trillions of messages daily, describing quota enforcement, a ZooKeeper‑free consumer, reliability enhancements, security plans, monitoring frameworks, fault‑injection testing, cluster balancing, and integration with other internal data systems.

Big DataLinkedInMonitoring
0 likes · 12 min read
LinkedIn’s Kafka at Scale: Architecture, Optimizations, and Operational Practices
Efficient Ops
Efficient Ops
Nov 29, 2015 · Big Data

Memory Computing vs Big Data: Trends, Platforms, and Architecture Choices

This article summarizes a WeChat group Q&A on the current momentum of in‑memory computing, compares TimesTen and SAP HANA, and offers practical advice on building enterprise big‑data platforms, covering cloud vs self‑build, talent, investment, and real‑world case studies.

Big Dataarchitecturein-memory databases
0 likes · 11 min read
Memory Computing vs Big Data: Trends, Platforms, and Architecture Choices
21CTO
21CTO
Nov 27, 2015 · Fundamentals

What Tech Stack Powers the Most Successful Startups? Insights from AngelList Data

A recent study analyzes startup technology choices, revealing the most popular programming languages, frontend frameworks, databases, mobile platforms, infrastructure services, DevOps tools, search technologies, API integrations, and advanced big‑data solutions across different performance tiers.

Big Datafrontendprogramming languages
0 likes · 5 min read
What Tech Stack Powers the Most Successful Startups? Insights from AngelList Data
Efficient Ops
Efficient Ops
Nov 26, 2015 · Big Data

Expert Insights on User Profiling and Stream Processing in Big Data

This article presents expert Q&A on effective user behavior analysis techniques for building detailed user profiles and compares mainstream stream‑processing solutions, outlining key factors such as latency, throughput, parallelism, and fault tolerance for selecting the right real‑time data platform.

Big Datastream processinguser profiling
0 likes · 11 min read
Expert Insights on User Profiling and Stream Processing in Big Data
21CTO
21CTO
Nov 26, 2015 · Big Data

How Taobao Scales Massive Data Products: Architecture Insights from Data Cube

This article explores Taobao's massive data product architecture, detailing its five-layer design, the use of Hadoop and real‑time systems, hybrid relational and NoSQL storage, a middleware layer for data integration, and systematic caching strategies that enable petabyte‑scale analytics and fast query responses.

Big DataCachingstorage
0 likes · 16 min read
How Taobao Scales Massive Data Products: Architecture Insights from Data Cube
21CTO
21CTO
Nov 23, 2015 · Big Data

How Dianping Scales Real‑Time Analytics with Apache Storm

This article explains how Dianping built a millisecond‑level real‑time computation platform using Apache Storm, covering use cases, system architecture, core Storm concepts, performance tuning, best practices, and a detailed Q&A on their production deployment.

Apache StormBig DataPerformance Tuning
0 likes · 23 min read
How Dianping Scales Real‑Time Analytics with Apache Storm
21CTO
21CTO
Nov 19, 2015 · Big Data

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

This article surveys the evolution of Hadoop and its ecosystem, explains core storage and processing concepts, and introduces contemporary big‑data technologies such as Spark, Flink, Kafka, Lambda architecture, NoSQL databases, and cloud‑native solutions, highlighting their roles and trade‑offs.

Big DataFlinkHadoop
0 likes · 17 min read
Beyond Hadoop: Modern Big Data Platforms and Technologies Explained
Architect
Architect
Nov 19, 2015 · Cloud Computing

Alibaba Cloud Enterprise Architecture Behind Double 11: A Deep Dive into Scalable Cloud Computing

The article details how Alibaba Cloud's multi‑layered enterprise architecture, built on service‑oriented frameworks, distributed databases, and message queues, enabled record‑breaking Double 11 transactions while offering linear performance scaling, high reliability, and cost‑effective operations for large‑scale internet applications.

Alibaba CloudBig DataEnterprise Architecture
0 likes · 8 min read
Alibaba Cloud Enterprise Architecture Behind Double 11: A Deep Dive into Scalable Cloud Computing
21CTO
21CTO
Nov 13, 2015 · Artificial Intelligence

7 Essential Python Tools Every Data Scientist Should Master

Aspiring data specialists should cultivate curiosity and hands‑on experience with production‑grade tools, and this guide highlights seven indispensable Python libraries—IPython, GraphLab Create, pandas, PuLP, matplotlib, scikit‑learn, and Spark—each explained with key features to boost your data‑science career.

Big DataPythondata analysis
0 likes · 9 min read
7 Essential Python Tools Every Data Scientist Should Master
Architect
Architect
Nov 9, 2015 · Big Data

Modeling User Relationships and Information Propagation on Weibo

The article presents a comprehensive analysis of Weibo's social graph, introducing metrics such as propagation power, intimacy, fan and follow similarity, two‑degree relationships, and relationship circles to model and quantify user interactions and information diffusion within the platform.

Big DataUser RelationshipWeibo
0 likes · 13 min read
Modeling User Relationships and Information Propagation on Weibo
21CTO
21CTO
Nov 4, 2015 · Big Data

How We Built a Real‑Time Log Analytics Platform with Storm and Cardinality Counting

To monitor hundreds of web apps on UAE’s PaaS platform in near‑real time, we combined Storm with lightweight log transport, a memcached‑based fqueue, and adaptive cardinality counting to efficiently compute PV, UV, response times, and custom metrics while handling cross‑cluster log aggregation.

Big DataCardinality countingLog Processing
0 likes · 9 min read
How We Built a Real‑Time Log Analytics Platform with Storm and Cardinality Counting
21CTO
21CTO
Oct 26, 2015 · Big Data

Why the Internet May Fade: The Rise of the Internet of Things

The article explores Eric Schmidt's bold claim that the traditional Internet will disappear, outlines how the Internet of Things is poised to dominate with massive market potential, highlights major tech companies' IoT strategies, compares IoT with the Internet, and details the key technologies driving this new ecosystem.

Big DataIoTinternet of things
0 likes · 11 min read
Why the Internet May Fade: The Rise of the Internet of Things
Architect
Architect
Oct 17, 2015 · Big Data

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Big DataData WarehouseHadoop
0 likes · 12 min read
Designing an Agile Data Warehouse and Data Platform for Internet Companies
Qunar Tech Salon
Qunar Tech Salon
Oct 16, 2015 · Databases

Choosing the Right NoSQL Database: MongoDB, Cassandra, and HBase Compared

The article examines why enterprises should consider NoSQL over Hadoop for big data storage, compares the three leading NoSQL databases—MongoDB, Cassandra, and HBase—based on market popularity, technical strengths, scalability, and use‑case suitability, and concludes with guidance on selecting the most appropriate solution.

Big DataCassandraMongoDB
0 likes · 11 min read
Choosing the Right NoSQL Database: MongoDB, Cassandra, and HBase Compared

Understanding Storm: A Distributed Real-Time Computation System

The article explains the need for low‑latency, high‑performance, distributed real‑time processing, outlines the challenges such systems must address, and introduces Storm as a Hadoop‑like framework for stream processing, detailing its architecture, fault‑tolerance mechanisms, transactional topology, and large‑scale deployment at Taobao.

Big DataReal-time ProcessingStorm
0 likes · 14 min read
Understanding Storm: A Distributed Real-Time Computation System
21CTO
21CTO
Sep 24, 2015 · Big Data

Comparing Apache Storm, Spark, and Samza: Which Real‑Time Stream Processor Fits Your Needs?

Apache Storm, Spark Streaming, and Samza are three open‑source, low‑latency, scalable distributed systems for real‑time data processing; this article outlines their architectures, key concepts, differences in data handling, state management, delivery guarantees, and typical use‑cases to help you choose the right framework.

Apache SamzaApache StormBig Data
0 likes · 7 min read
Comparing Apache Storm, Spark, and Samza: Which Real‑Time Stream Processor Fits Your Needs?

Comparative Overview of Apache Storm, Spark Streaming, and Samza for Real-Time Data Processing

This article introduces Apache Storm, Spark Streaming, and Samza, explains their architectures, common features, key differences such as delivery guarantees and state management, and provides guidance on selecting the most suitable framework for various real‑time big‑data use cases.

Apache StormBig DataComparison
0 likes · 8 min read
Comparative Overview of Apache Storm, Spark Streaming, and Samza for Real-Time Data Processing
21CTO
21CTO
Sep 19, 2015 · Artificial Intelligence

Why Distributed Machine Learning Needs More Data Than Speed

The article explains how distributed machine learning evolved from parallel computing to handle massive, long‑tail data sets, discusses the importance of scalability, fault recovery, and data‑parallel algorithms, and reviews frameworks such as MPI, MapReduce, and Pregel for building large‑scale AI systems.

Big DataLDAMPI
0 likes · 24 min read
Why Distributed Machine Learning Needs More Data Than Speed
21CTO
21CTO
Sep 14, 2015 · Backend Development

Why Simple‑Looking Sites Like Taobao Need Hundreds of Top Engineers

Although sites like Taobao appear simple to users, they rely on massive distributed search, caching, storage, load‑balancing, CDN, logging, and data‑analysis systems that demand sophisticated backend engineering, massive infrastructure, and specialized algorithms, explaining why countless top engineers are required to keep them running.

Big DataCachingScalable Architecture
0 likes · 12 min read
Why Simple‑Looking Sites Like Taobao Need Hundreds of Top Engineers
Art of Distributed System Architecture Design
Art of Distributed System Architecture Design
Aug 20, 2015 · Industry Insights

Which Ten Keywords Will Define Enterprise Software Architecture Over the Next Decade?

The article distills ten pivotal keywords—Industrial 4.0, Internet+, BFV, microservices, distributed systems, big data, multi‑screen fusion, Docker, OpenStack, and large‑platform micro‑apps—explaining how each shapes the evolution of enterprise software architecture and what challenges and opportunities they bring.

Big DataCloud ComputingEnterprise Architecture
0 likes · 11 min read
Which Ten Keywords Will Define Enterprise Software Architecture Over the Next Decade?
Qunar Tech Salon
Qunar Tech Salon
Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib
0 likes · 5 min read
Overview of Spark Big Data Analytics Framework Components
21CTO
21CTO
Aug 14, 2015 · Frontend Development

From AJAX to Node: A Journey Through Modern Web Development

Tracing the evolution of web technologies—from early AJAX challenges and jQuery’s rise, through Chrome’s dominance, GitHub’s impact, OAuth, JSON, and modern frameworks like Node.js and Bootstrap—the article reflects on how these tools reshaped frontend development and the broader software landscape.

AJAXBig DataNode.js
0 likes · 14 min read
From AJAX to Node: A Journey Through Modern Web Development
Hulu Beijing
Hulu Beijing
Aug 14, 2015 · Big Data

How Voidbox Bridges Docker and YARN for Scalable Big Data Workloads

Voidbox integrates Docker containers with YARN to simplify distributed application development, improve deployment, boost cluster efficiency, and provide fault‑tolerant, DAG‑based execution modes, enabling seamless resource management for Hadoop‑based big data jobs.

Big DataCluster ComputingDAG
0 likes · 17 min read
How Voidbox Bridges Docker and YARN for Scalable Big Data Workloads
Efficient Ops
Efficient Ops
Jul 28, 2015 · Operations

How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

This article explains how Tencent Game's BlueKing platform redesigns operations by building open‑source PaaS capabilities, automating fault self‑healing, enabling fully automated game server region launches, supporting self‑service change releases, leveraging big‑data for real‑time decisions, and moving toward open‑source and hybrid‑cloud solutions.

AutomationBig Datafault-recovery
0 likes · 19 min read
How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch

Selection and Comparison of Big Data Benchmark Standards with a Focus on TPC‑DS

This article reviews the evolution of big‑data management technologies, discusses the criteria for choosing appropriate big‑data benchmarks, compares existing benchmarks such as MapReduce tests, YCSB, BigBench and BigFrame, and provides an in‑depth analysis of the TPC‑DS benchmark and its certification status.

Big DataData ManagementSQL
0 likes · 15 min read
Selection and Comparison of Big Data Benchmark Standards with a Focus on TPC‑DS
Architect
Architect
Jul 18, 2015 · Databases

Qihoo 360’s Use of MongoDB: Architecture, Practices, and Lessons Learned

The article details how Qihoo 360 adopted MongoDB since 2011, scaling to over 100 applications, 1,500 instances and 20 billion daily queries, and shares their architectural choices, backup strategies, best‑practice recommendations, and advice for teams considering MongoDB in large‑scale, cloud‑native environments.

Backup StrategiesBig DataDatabase Architecture
0 likes · 12 min read
Qihoo 360’s Use of MongoDB: Architecture, Practices, and Lessons Learned
Qunar Tech Salon
Qunar Tech Salon
Jul 12, 2015 · Big Data

Airbnb OpenAir Conference: Open‑Source Tools Airpal, Aerosolve, and Airflow

At Airbnb’s inaugural OpenAir conference, the company unveiled three open‑source big‑data tools—Airpal, a Presto‑based visual SQL query engine; Aerosolve, an interpretable machine‑learning engine for pricing recommendations; and Airflow, an internal platform for orchestrating and monitoring data pipelines.

AirbnbBig DataOpenAir
0 likes · 4 min read
Airbnb OpenAir Conference: Open‑Source Tools Airpal, Aerosolve, and Airflow
Model Perspective
Model Perspective
Jul 6, 2015 · Big Data

What Will Future Schools Look Like? Insights from Global Education Leaders

Amid heated debate over China’s Hengshui model, educators worldwide are envisioning future schools that leverage big-data analytics, immersive technology, and flexible, student-centered learning to cultivate critical thinking, creativity, and empathy, moving beyond traditional exam-driven curricula toward personalized, interdisciplinary education.

21st century skillsBig Datafuture education
0 likes · 8 min read
What Will Future Schools Look Like? Insights from Global Education Leaders

Storm vs Spark: Which Real‑Time Analytics Platform Wins for Your Business?

The article compares Apache Storm and Apache Spark, examining their origins, architecture, language support, integration capabilities, and performance characteristics, and offers guidance on selecting the right platform for real‑time business intelligence based on specific workload and infrastructure needs.

Apache SparkApache StormBig Data
0 likes · 11 min read
Storm vs Spark: Which Real‑Time Analytics Platform Wins for Your Business?

Social Network Analysis on Weibo: Label Propagation, User Similarity, Community Detection, Influence Ranking, and Spam User Identification

This article introduces a series of algorithms for analyzing the Weibo social network, including label propagation, LDA‑based user similarity, time‑aware and interaction‑aware similarity measures, community detection, influence ranking via PageRank variants, and methods for identifying spam users, illustrating how these techniques can be applied to large‑scale social media data.

Big DataSocial Network Analysisinfluence ranking
0 likes · 19 min read
Social Network Analysis on Weibo: Label Propagation, User Similarity, Community Detection, Influence Ranking, and Spam User Identification

Designing a Scalable Real‑Time Mobile Analytics Platform with Kafka, Storm, and Amazon EMR

The article describes how a mobile analytics service processes billions of events daily using a Lambda‑style architecture that combines Kafka, Storm, Amazon EMR, and S3 to achieve scalable, fault‑tolerant batch and real‑time computation, while ensuring reliable event ingestion and graceful degradation.

AWSBig DataStorm
0 likes · 8 min read
Designing a Scalable Real‑Time Mobile Analytics Platform with Kafka, Storm, and Amazon EMR

Mastering HBase: Table Structure, API Usage, and Performance Tuning

This article explains HBase's column‑oriented architecture, key concepts such as Rowkey, ColumnFamily, and Region, provides Java API examples for table operations, and offers practical optimization techniques—including pre‑splitting, Rowkey design, caching, and compaction settings—to improve read/write performance.

Big DataHBaseJava API
0 likes · 20 min read
Mastering HBase: Table Structure, API Usage, and Performance Tuning
High Availability Architecture
High Availability Architecture
May 15, 2015 · Big Data

Real-Time Computing at Dianping: Architecture, Use Cases, and Best Practices

During a detailed live session, senior Dianping engineer Wang Xinchun explains the company's real‑time computing platform built on Apache Storm, covering use cases such as dashboards, search and recommendation, system architecture, data ingestion tools like Blackhole and Puma, performance tuning, monitoring, and practical best‑practice recommendations.

Apache StormBig DataReal-Time Computing
0 likes · 21 min read
Real-Time Computing at Dianping: Architecture, Use Cases, and Best Practices
Ctrip Technology
Ctrip Technology
May 14, 2015 · Artificial Intelligence

Data‑Driven User Experience: Machine Learning Applications in Hotel Booking and Marketing at Ctrip

In his 2015 China Hotel Marketing Summit keynote, Ctrip CTO Ye Yamin explained how machine‑learning models built on purchase behavior and order data improve hotel room availability predictions, shorten confirmation times, personalize recommendations, and evaluate advertising effectiveness, illustrating a data‑driven approach to user experience and operations.

Big DataMarketingdata analytics
0 likes · 14 min read
Data‑Driven User Experience: Machine Learning Applications in Hotel Booking and Marketing at Ctrip
MaGe Linux Operations
MaGe Linux Operations
Apr 28, 2015 · Big Data

How LinkedIn Scales Kafka to Billions of Messages Every Day

This article explains how LinkedIn uses Apache Kafka as a high‑throughput, fault‑tolerant messaging backbone, detailing its architecture, message categories, layered replication, audit mechanisms, and the engineering practices that keep billions of daily messages reliable and fast.

Big DataLinkedIndistributed systems
0 likes · 11 min read
How LinkedIn Scales Kafka to Billions of Messages Every Day

Understanding Stream Processing, Event Sourcing, and Complex Event Processing

The article explains the fundamentals of stream processing, event sourcing, and complex event processing, comparing raw event storage with aggregated results, illustrating architectures with Kafka, Samza, and other frameworks, and highlighting benefits such as scalability, flexibility, and decoupling for modern data‑driven systems.

Apache KafkaApache SamzaBig Data
0 likes · 11 min read
Understanding Stream Processing, Event Sourcing, and Complex Event Processing
MaGe Linux Operations
MaGe Linux Operations
Apr 7, 2015 · Big Data

How Hadoop’s Tiered Storage Optimizes Data Based on Temperature

This article explains Hadoop’s tiered storage concept, describing how data is classified by temperature—hot, warm, cold, frozen—and automatically moved across disk and archive layers to optimize cost and performance, with examples from Hadoop versions and eBay’s large‑scale deployment.

Big DataData TemperatureHDFS
0 likes · 9 min read
How Hadoop’s Tiered Storage Optimizes Data Based on Temperature