How to Detect Missing Data When Syncing PostgreSQL to Elasticsearch with Logstash
This guide explains why Logstash may import fewer rows from a large PostgreSQL table into Elasticsearch, and provides two practical solutions—ID comparison via shell scripts and accelerated comparison using Redis—to quickly identify and resolve data inconsistencies.
1. Real-world Issues
Q1: Logstash syncing PostgreSQL to Elasticsearch results in data inconsistency.
When using Logstash to import a table from PostgreSQL to ES, the record count in ES is far lower than in PG (the table has 76 million rows). How to quickly compare which data was not inserted? No errors appear in Logstash logs.
Q2: In an asynchronous dual‑write database and ES scenario, how to guarantee consistency between the database and ES?
2. Recommended Solution – ID Comparison Method
Example uses problem 1; problem 2 follows the same principle.
2.1 Approach
To find data not inserted into ES, you can:
Ensure Logstash input JDBC driver is correctly configured and the statement selects all needed rows.
Check Logstash output plugin configuration for correct ES connection parameters and ensure no filter discards data.
Add a stdout plugin to write records read from PostgreSQL to a file.
Example configuration:
output {
elasticsearch {
...Elasticsearch configuration...
}
stdout {
codec => json_lines
path => "/path/to/logstash_output.log"
}
}Compare the Logstash output file with the original PostgreSQL data using a script (Python, Shell, etc.). If counts match but ES differs, check ES cluster health and logs.
If the problem persists, try reducing the bulk operation size by setting flush_size and idle_flush_time in the Logstash output.
For large data volumes, adjust Logstash and ES performance settings (batch size, JVM, thread pools).
2.2 Comparison Script Implementation
Shell script example compares Logstash output (JSON) with PostgreSQL data by IDs.
#!/bin/bash
# Extract IDs from JSON file
jq '.id' /path/to/logstash_output.log > logstash_ids.txt
# Remove quotes
sed -i 's/"//g' logstash_ids.txt
# Sort IDs
sort -n logstash_ids.txt > logstash_ids_sorted.txt
sort -n /path/to/postgres_data.csv > postgres_ids_sorted.txt
# Find missing IDs
comm -23 postgres_ids_sorted.txt logstash_ids_sorted.txt > missing_ids.txt
echo "Missing IDs:"
cat missing_ids.txtMake script executable and run:
chmod +x compare.sh
./compare.shRequires jq to be installed.
3. Alternative Solution – Redis Accelerated Comparison
Store IDs from PostgreSQL and Logstash output in Redis sets and compute the difference.
import redis
import csv
r = redis.StrictRedis(host='localhost', port=6379, db=0)
# Load PostgreSQL IDs
with open('/path/to/postgres_data.csv', newline='') as csvfile:
csv_reader = csv.reader(csvfile)
next(csv_reader)
for row in csv_reader:
r.sadd('postgres_ids', row[0])
# Load Logstash IDs
with open('/path/to/logstash_output.log', newline='') as logstash_file:
for line in logstash_file:
id = line.split('"id":')[1].split(',')[0].strip()
r.sadd('logstash_ids', id)
missing_ids = r.sdiff('postgres_ids', 'logstash_ids')
print("Missing IDs:")
for mid in missing_ids:
print(mid)Install the Redis library with pip install redis. This method is faster for large datasets but requires a running Redis server.
4. Summary
Solution 1: Shell script + grep
Pros: Simple, no extra tools required.
Cons: Slower, high I/O for large data volumes.
Solution 2: Redis accelerated comparison
Pros: Faster, scalable for massive data.
Cons: More complex, requires Redis installation.
Choose the approach that matches your data size and performance requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
