Operations 9 min read

How to Detect Missing Data When Syncing PostgreSQL to Elasticsearch with Logstash

This guide explains why Logstash may import fewer rows from a large PostgreSQL table into Elasticsearch, and provides two practical solutions—ID comparison via shell scripts and accelerated comparison using Redis—to quickly identify and resolve data inconsistencies.

Programmer DD

Jun 30, 2023

How to Detect Missing Data When Syncing PostgreSQL to Elasticsearch with Logstash

1. Real-world Issues

Q1: Logstash syncing PostgreSQL to Elasticsearch results in data inconsistency.

When using Logstash to import a table from PostgreSQL to ES, the record count in ES is far lower than in PG (the table has 76 million rows). How to quickly compare which data was not inserted? No errors appear in Logstash logs.

Q2: In an asynchronous dual‑write database and ES scenario, how to guarantee consistency between the database and ES?

2. Recommended Solution – ID Comparison Method

Example uses problem 1; problem 2 follows the same principle.

2.1 Approach

To find data not inserted into ES, you can:

Ensure Logstash input JDBC driver is correctly configured and the statement selects all needed rows.

Check Logstash output plugin configuration for correct ES connection parameters and ensure no filter discards data.

Add a stdout plugin to write records read from PostgreSQL to a file.

Example configuration:

output {
  elasticsearch {
    ...Elasticsearch configuration...
  }
  stdout {
    codec => json_lines
    path => "/path/to/logstash_output.log"
  }
}

Compare the Logstash output file with the original PostgreSQL data using a script (Python, Shell, etc.). If counts match but ES differs, check ES cluster health and logs.

If the problem persists, try reducing the bulk operation size by setting flush_size and idle_flush_time in the Logstash output.

For large data volumes, adjust Logstash and ES performance settings (batch size, JVM, thread pools).

2.2 Comparison Script Implementation

Shell script example compares Logstash output (JSON) with PostgreSQL data by IDs.

#!/bin/bash
# Extract IDs from JSON file
jq '.id' /path/to/logstash_output.log > logstash_ids.txt
# Remove quotes
sed -i 's/"//g' logstash_ids.txt
# Sort IDs
sort -n logstash_ids.txt > logstash_ids_sorted.txt
sort -n /path/to/postgres_data.csv > postgres_ids_sorted.txt
# Find missing IDs
comm -23 postgres_ids_sorted.txt logstash_ids_sorted.txt > missing_ids.txt
echo "Missing IDs:"
cat missing_ids.txt

Make script executable and run:

chmod +x compare.sh
./compare.sh

Requires jq to be installed.

3. Alternative Solution – Redis Accelerated Comparison

Store IDs from PostgreSQL and Logstash output in Redis sets and compute the difference.

import redis
import csv

r = redis.StrictRedis(host='localhost', port=6379, db=0)

# Load PostgreSQL IDs
with open('/path/to/postgres_data.csv', newline='') as csvfile:
    csv_reader = csv.reader(csvfile)
    next(csv_reader)
    for row in csv_reader:
        r.sadd('postgres_ids', row[0])

# Load Logstash IDs
with open('/path/to/logstash_output.log', newline='') as logstash_file:
    for line in logstash_file:
        id = line.split('"id":')[1].split(',')[0].strip()
        r.sadd('logstash_ids', id)

missing_ids = r.sdiff('postgres_ids', 'logstash_ids')
print("Missing IDs:")
for mid in missing_ids:
    print(mid)

Install the Redis library with pip install redis. This method is faster for large datasets but requires a running Redis server.

4. Summary

Solution 1: Shell script + grep

Pros: Simple, no extra tools required.

Cons: Slower, high I/O for large data volumes.

Solution 2: Redis accelerated comparison

Pros: Faster, scalable for massive data.

Cons: More complex, requires Redis installation.

Choose the approach that matches your data size and performance requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Elasticsearch Redis Data Consistency PostgreSQL Logstash Shell script ID Comparison

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.