Tagged articles

Spark

623 articles · Page 7 of 7

Apr 12, 2016 · Artificial Intelligence

Designing System and Personalized Recommendation Engines with Mahout and Spark

This article explains the architecture of both system-wide and personalized recommendation modules, compares three recommendation strategies, details the use of Apache Mahout for collaborative filtering with Java code examples, and discusses cold‑start solutions within a Spark‑Hadoop stack.

MahoutSparkcold-start

0 likes · 15 min read

Designing System and Personalized Recommendation Engines with Mahout and Spark

Architecture Digest

Apr 9, 2016 · Big Data

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

This article describes how Meituan migrated from Hive‑SQL and MapReduce to Spark on YARN, built an interactive Zeppelin‑based development platform, created reusable ETL templates, constructed a Spark‑driven feature and data‑mining platform, and applied Spark to interactive user‑behavior analysis and large‑scale SEM services, highlighting performance gains and operational benefits.

Big DataData PlatformDistributed Computing

0 likes · 19 min read

Practical Experience of Using Spark at Meituan: Platformization, ETL Templates, Feature Platform, Data Mining, and Real‑World Applications

Architecture Digest

Mar 28, 2016 · Big Data

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

This article provides a comprehensive overview of Hadoop and its surrounding ecosystem, detailing core components, storage principles, key algorithms, and a wide range of modern big‑data technologies such as Spark, Flink, Kafka, NoSQL databases, and cloud‑based processing platforms.

Big DataHadoopNoSQL

0 likes · 11 min read

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

Architect

Mar 6, 2016 · Big Data

Clustering Geolocated User Events with DBSCAN and Spark

This article explains how to apply the DBSCAN clustering algorithm to geolocated user event data and leverage Apache Spark’s distributed processing with PairRDDs to efficiently identify frequent user regions, detect outliers, and build location‑based services such as personalized recommendations and security alerts.

Big DataClusteringDBSCAN

0 likes · 8 min read

Clustering Geolocated User Events with DBSCAN and Spark

Architect

Feb 29, 2016 · Big Data

Design Principles of Real-Time Distributed Streaming Systems: A Comparison of Spark and Storm

This article examines the design considerations of real-time distributed streaming systems, outlines their background and characteristics, compares the architectures of Spark Streaming and Storm, discusses primitives, message passing, high availability, storage models, and integration with production environments, providing practical insights for architects.

High AvailabilityReal-time ProcessingSpark

0 likes · 20 min read

Design Principles of Real-Time Distributed Streaming Systems: A Comparison of Spark and Storm

ITPUB

Jan 20, 2016 · Big Data

How Meizu Built an Agile Big Data Platform for Millions of Users

The Meizu Tech Open Day showcased the company's rapid evolution to a data‑driven mobile internet firm, detailing its DW1.0 and DW2.0 data‑warehouse architectures, recommendation pipelines, Spark adoption, and ELK‑based log analytics, while sharing practical lessons and future challenges.

Big DataData ArchitectureData Warehouse

0 likes · 11 min read

How Meizu Built an Agile Big Data Platform for Millions of Users

High Availability Architecture

Jan 6, 2016 · Big Data

Spark Latest Features, Tungsten Project, and Hulu’s Production Practices

This article reviews Spark's evolution from version 1.2 to 1.6, explains the DataFrame and Tungsten projects, shares Hulu’s real‑world Spark deployments, and discusses performance‑related challenges such as stack overflow, streaming receiver latency, and class‑loader deadlocks.

DataFramesDataset APIHulu

0 likes · 17 min read

Spark Latest Features, Tungsten Project, and Hulu’s Production Practices

Architect

Dec 31, 2015 · Big Data

Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

The article explains how to leverage Apache Spark for machine‑learning tasks, large‑scale new‑word discovery, and simple intelligent question‑answering by using Spark‑Shell, Scala code, and word2vec‑based similarity, while sharing practical tips and performance considerations.

Big DataIntelligent QANew Word Discovery

0 likes · 15 min read

Using Spark for Machine Learning, New Word Discovery, and Intelligent Q&A

Efficient Ops

Dec 29, 2015 · Big Data

Unlocking Spark: Machine Learning, New Word Discovery, and Smart Q&A Techniques

This article explains how to leverage Spark for machine learning, discover new terms from massive text corpora, and build intelligent question‑answer systems, sharing practical tips, performance considerations, and real‑world examples for data analysts and algorithm engineers.

Intelligent QANew Word DiscoveryScala

0 likes · 15 min read

Unlocking Spark: Machine Learning, New Word Discovery, and Smart Q&A Techniques

Architect

Dec 2, 2015 · Big Data

Designing an Agile Data Warehouse Architecture for Internet Companies

The article outlines a practical, end‑to‑end data platform architecture for internet businesses, covering data collection, storage and analysis, sharing, real‑time processing, task scheduling, and the importance of simplicity and agility in building an agile data warehouse.

Big DataData ArchitectureData Warehouse

0 likes · 10 min read

Designing an Agile Data Warehouse Architecture for Internet Companies

dbaplus Community

Nov 27, 2015 · Big Data

Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained

This article provides a comprehensive overview of Apache Spark, covering its origins, core concepts such as RDDs, transformations, actions, dependencies, execution modes, and key components like Spark SQL, Streaming, MLlib, and GraphX, while also offering practical code examples and visual illustrations.

DataFramesGraphXMLlib

0 likes · 18 min read

Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained

Art of Distributed System Architecture Design

Nov 23, 2015 · Big Data

How Spark Enables Real‑Time Microservice Performance Tracing in the Cloud

This article explains how IBM Research leverages Spark to capture and analyze network traffic of microservice‑based applications in an OpenStack cloud, providing real‑time transaction tracing and batch latency statistics to reveal service dependencies and performance bottlenecks.

Big DataCloudMicroservices

0 likes · 8 min read

How Spark Enables Real‑Time Microservice Performance Tracing in the Cloud

21CTO

Nov 19, 2015 · Big Data

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

This article surveys the evolution of Hadoop and its ecosystem, explains core storage and processing concepts, and introduces contemporary big‑data technologies such as Spark, Flink, Kafka, Lambda architecture, NoSQL databases, and cloud‑native solutions, highlighting their roles and trade‑offs.

Big DataFlinkHadoop

0 likes · 17 min read

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

Art of Distributed System Architecture Design

Oct 29, 2015 · Big Data

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.

Data PlatformHadoopSpark

0 likes · 16 min read

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

Architect

Oct 17, 2015 · Big Data

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Big DataData WarehouseHadoop

0 likes · 12 min read

Designing an Agile Data Warehouse and Data Platform for Internet Companies

Efficient Ops

Oct 14, 2015 · Big Data

Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A

During a lively “Sit and Discuss” session, experts compared Spark and Hadoop, evaluated Flink against Spark, contrasted HBase with Cassandra, explained why Kafka (and sometimes Flink) is preferred for distributed messaging, and shared insights on Tachyon’s role in modern big‑data ecosystems.

CassandraFlinkHBase

0 likes · 10 min read

Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A

Art of Distributed System Architecture Design

Oct 10, 2015 · Artificial Intelligence

Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters

This article describes how Yahoo integrated deep learning into its massive Hadoop ecosystem by adding GPU nodes, using YARN and Spark to run Caffe at scale, and presents performance results on AlexNet and GoogLeNet alongside open‑source contributions.

Big DataCaffeGPU

0 likes · 9 min read

Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters

Qunar Tech Salon

Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib

0 likes · 5 min read

Overview of Spark Big Data Analytics Framework Components

Suning Technology

Jul 29, 2015 · Big Data

Highlights from the 2015 Suning Big Data Meetup: Platforms, Spark, and Octopus

The 2015 Suning Big Data Meetup in Nanjing gathered industry experts and researchers to showcase Suning's data platform architecture, Intel's Spark advancements, ZTE's DAP system, and a unified Octopus programming model, emphasizing open knowledge sharing and practical big‑data solutions.

Octopus ModelSparkSuning

0 likes · 6 min read

Highlights from the 2015 Suning Big Data Meetup: Platforms, Spark, and Octopus

Art of Distributed System Architecture Design

Jun 1, 2015 · Big Data

Overview of Big Data Technologies and Architectures

This article provides a comprehensive overview of major big‑data platforms such as Hadoop, Spark, Flink, Kafka, and related ecosystem components, explaining their core concepts, storage models, processing frameworks, and architectural patterns for handling massive, distributed datasets.

HadoopNoSQLSpark

0 likes · 18 min read

Overview of Big Data Technologies and Architectures

MaGe Linux Operations

Feb 3, 2015 · Big Data

Why Spark Beats Hadoop: Exploring RDDs, In‑Memory Computing, and Fault Tolerance

This article explains how Apache Spark improves on Hadoop MapReduce by keeping intermediate data in memory, introduces the core RDD abstraction, compares Spark’s transformations and actions with Hadoop, and shows how Spark can run on Standalone, YARN, and various programming languages such as Scala, Java, and Python.

Big DataJavaRDD

0 likes · 20 min read

Why Spark Beats Hadoop: Exploring RDDs, In‑Memory Computing, and Fault Tolerance

Baidu Tech Salon

Jan 13, 2015 · Big Data

Inside Spark 1.2: New APIs, In‑Memory Columnar Storage, and Baidu’s High‑Performance Shuffle

This article reviews Spark 1.2’s major enhancements—including the External Data Source API, column pruning, predicate pushdown, and in‑memory columnar storage—while also detailing Baidu’s large‑scale Spark deployments, its custom high‑performance Shuffle service, and the integration of Spark with the Tachyon memory file system.

BaiduBig DataExternal Data Source API

0 likes · 16 min read

Inside Spark 1.2: New APIs, In‑Memory Columnar Storage, and Baidu’s High‑Performance Shuffle

Qunar Tech Salon

Dec 4, 2014 · Big Data

Understanding Apache Spark: Architecture, Comparison with Hadoop, Features, and Use Cases

The article explains Apache Spark’s memory‑based distributed computing model, its advantages over Hadoop’s MapReduce, key features, fault tolerance, deployment modes, ecosystem components, and the scenarios where Spark is most effective for large‑scale data analytics.

Distributed ComputingHadoopSpark

0 likes · 7 min read

Understanding Apache Spark: Architecture, Comparison with Hadoop, Features, and Use Cases