Operations 4 min read

Speed Up File Path Verification with Bash: Split & Parallel Execution

This guide shows how to efficiently verify whether millions of file paths exist on remote servers by splitting a large list into smaller chunks and processing each chunk concurrently with a Bash script, dramatically reducing runtime compared to a single‑threaded approach.

Liangxu Linux
Liangxu Linux
Liangxu Linux
Speed Up File Path Verification with Bash: Split & Parallel Execution

Background

In a production environment a MapReduce job generated a file list containing over 3 million lines, each line including a file path that needs to be checked for existence on servers outside the Hadoop cluster. A simple Bash script was initially considered but proved too slow.

Method 1 – Simple Linear Scan (Inefficient)

The first script reads the original file line by line, extracts the fifth field (the directory) with awk, tests the path with -e, and appends the line to exist.txt or noexist.txt accordingly.

#!/bin/bash
count=0
cat oriTest.txt | while read data; do
  count=$(( $count + 1 ))
  echo $count
  dir=$(echo "$data" | awk -F "\t" '{print $5}')
  if [ -e $dir ]; then
    echo "$data" >> exist.txt
  else
    echo "$data" >> noexist.txt
  fi
done

Processing 5,000 lines took nearly 4–5 minutes on an 8‑core machine, which was unacceptable.

Method 2 – Split File and Parallel Processing

The improved approach divides the large file into smaller pieces and processes each piece in the background.

Step 1: Split the Large File

split -l 10000 oriTest.txt

This creates files named xaa, xab, … each containing 10,000 lines.

Step 2: Store Chunk Names in an Array

declare -a files
files=($(ls x*))

Step 3: Define a Function to Process One Chunk

readdata(){
  cat $1 | while read data; do
    dir=$(echo "$data" | awk -F "\t" '{print $5}')
    if [ -e $dir ]; then
      echo "$data" >> "exist_$1.txt"
    else
      echo "$data" >> "noexist_$1.txt"
    fi
  done
}

Step 4: Launch Parallel Workers

for i in ${files[@]}; do
  echo $i
  readdata $i &
 done

Each readdata call runs in the background, allowing multiple chunks to be processed simultaneously and reducing total execution time dramatically.

Conclusion

By splitting the input file and executing the verification function in parallel, the script scales with the number of CPU cores and completes the existence check far faster than the single‑threaded version, making it suitable for large‑scale file‑path validation tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

bashparallel processingsplit commandfile verificationLinux scripting
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.