Big Data 11 min read

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

This article demonstrates how to generate 500 million visitor IDs with Spark, use map‑reduce operations to count occurrences, and identify the ID with the highest visit count, while discussing performance considerations such as memory spilling and cluster resources.

Big Data Technology & Architecture

Jan 25, 2020

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

Scenario description: For a large website with billions of visits, the article outlines the typical data recorded per user (ID, timestamp, dwell time, actions, IP, etc.) and estimates that storing 500 million user IDs would occupy roughly 5 GB of disk space.

Problem description: Given a generated dataset of 500 million IDs, the goal is to find the ID that appears most frequently, using this as a practical Spark interview exercise.

Problem analysis: The dataset can be cached as an RDD, each ID mapped to a count of 1, then aggregated with reduceByKey to obtain (ID, count) pairs, and finally the maximum count is extracted.

Implementation – Scala code:

import org.apache.spark.{SparkConf, SparkContext}

object ActiveVisitor {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("spark://master:7077").setAppName("ActiveVisitor")
    val sc = new SparkContext(conf)
    // generate a list 1..9999
    val list = 1 until 10000
    val id = new RandomId()
    var max = 0
    var maxId = 0L
    val lastNum = sc.parallelize(list)
      .flatMap(num => {
        var list2 = List(id.next())
        for (i <- 1 to 50000) {
          list2 = id.next() :: list2
        }
        println(num/1000.0 + "%")
        list2
      })
      .map((_, 1))
      .reduceByKey(_ + _)
      .foreach(x => {
        if (x._2 > max) {
          max = x._2
          maxId = x._1
          println(x)
        }
      })
  }
}

The same logic is repeated in slightly different forms later in the article, showing alternative ways to generate the data and process it.

Running and results: When submitted to a Spark cluster (four nodes, each with 5 GB memory), the job runs for about 47 minutes, producing logs that show parallel execution, frequent memory spilling (≈1 GB per spill, 49 times), and the final most‑frequent ID appearing only a few times (e.g., 8 occurrences).

Observations: Because most IDs appear only once or twice, sorting the entire result set would be wasteful; using foreach to track the maximum is more resource‑efficient. The article also notes that scaling to 5 billion records would require ~50 GB storage and significantly more time.

Conclusion: The example illustrates practical big‑data processing with Spark, covering data generation, aggregation, performance bottlenecks, and optimization strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Spark RDD Scala reduceByKey

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.