Big Data 18 min read

Processing 10GB Age Data on a 4GB PC: Single‑Thread vs Multi‑Thread Solutions

This article walks through generating a 10GB file of age data, reading it line‑by‑line on a machine with only 4GB RAM, and compares a single‑thread counting approach with a multithreaded producer‑consumer design, showing performance gains, memory usage, and practical tips.

IT Architects Alliance

May 25, 2022

Processing 10GB Age Data on a 4GB PC: Single‑Thread vs Multi‑Thread Solutions

Scenario Description

A 10GB file contains integers ranging from 18 to 70, each representing the count of people of that age; the task is to find the age that appears most frequently using a computer with 4GB memory and a dual‑core CPU.

Data Generation

Java code creates the 10GB file by writing random integers (18‑70) to disk, 1,000,000 records per line (≈4 MB per line, about 2,500 lines total). The code uses Random and BufferedWriter to append data.

Reading Data

A single‑thread method reads the file with BufferedReader.readLine(), printing progress every 100 lines and measuring elapsed time. Reading the entire 10GB file takes roughly 20 seconds, about 1 second per million records.

Single‑Thread Processing

The first solution processes the file line‑by‑line, splits each line by commas, and updates a ConcurrentHashMap<String, AtomicInteger> (countMap) where the key is the age and the value is its occurrence count. After the scan, the map is iterated to locate the age with the highest count.

Multithreaded Divide‑and‑Conquer

To improve CPU utilization, a producer‑consumer model is introduced:

Initialize a list of LinkedBlockingQueue instances (one per consumer thread, capacity 256).

The producer reads each line, determines a queue index using count.get() % threadNums, and puts the line into the corresponding queue.

Each consumer thread takes lines from its dedicated queue, splits them, and updates the shared countMap safely.

An auxiliary method splitStr divides a line into three roughly equal parts, adjusting split positions to avoid breaking on commas, and each part is processed by a separate thread.

Additional structures such as AtomicLong count, volatile boolean startConsumer, and volatile boolean consumerRunning control the workflow and termination.

Performance Results

Single‑thread processing consumes 2–2.5 GB memory, CPU usage stays low (20‑25%). The multithreaded version raises CPU utilization to over 90%, reduces total processing time from ~180 seconds to ~103 seconds (≈75% faster), while producing identical results.

Encountered Issues & Tips

During execution, occasional GC pauses can cause the program to stall; inserting short sleeps and explicit System.gc() calls after processing a batch mitigates this. The demo manually creates threads; in production, a thread pool should be used.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Performance algorithm Big Data multithreading File I/O Producer Consumer

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.