Speed Up File Path Verification with Bash: Split & Parallel Execution
This guide shows how to efficiently verify whether millions of file paths exist on remote servers by splitting a large list into smaller chunks and processing each chunk concurrently with a Bash script, dramatically reducing runtime compared to a single‑threaded approach.
Background
In a production environment a MapReduce job generated a file list containing over 3 million lines, each line including a file path that needs to be checked for existence on servers outside the Hadoop cluster. A simple Bash script was initially considered but proved too slow.
Method 1 – Simple Linear Scan (Inefficient)
The first script reads the original file line by line, extracts the fifth field (the directory) with awk, tests the path with -e, and appends the line to exist.txt or noexist.txt accordingly.
#!/bin/bash
count=0
cat oriTest.txt | while read data; do
count=$(( $count + 1 ))
echo $count
dir=$(echo "$data" | awk -F "\t" '{print $5}')
if [ -e $dir ]; then
echo "$data" >> exist.txt
else
echo "$data" >> noexist.txt
fi
doneProcessing 5,000 lines took nearly 4–5 minutes on an 8‑core machine, which was unacceptable.
Method 2 – Split File and Parallel Processing
The improved approach divides the large file into smaller pieces and processes each piece in the background.
Step 1: Split the Large File
split -l 10000 oriTest.txtThis creates files named xaa, xab, … each containing 10,000 lines.
Step 2: Store Chunk Names in an Array
declare -a files
files=($(ls x*))Step 3: Define a Function to Process One Chunk
readdata(){
cat $1 | while read data; do
dir=$(echo "$data" | awk -F "\t" '{print $5}')
if [ -e $dir ]; then
echo "$data" >> "exist_$1.txt"
else
echo "$data" >> "noexist_$1.txt"
fi
done
}Step 4: Launch Parallel Workers
for i in ${files[@]}; do
echo $i
readdata $i &
doneEach readdata call runs in the background, allowing multiple chunks to be processed simultaneously and reducing total execution time dramatically.
Conclusion
By splitting the input file and executing the verification function in parallel, the script scales with the number of CPU cores and completes the existence check far faster than the single‑threaded version, making it suitable for large‑scale file‑path validation tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
