Databases 12 min read

MySQL Data Recovery Using Binlog Analysis and Reverse Binlog Generation

This article details a real‑world MySQL production data loss incident, explains how to identify the relevant binlog range, use binlog2sql and MyFlash to generate reverse SQL or binary logs, and outlines the step‑by‑step recovery process and post‑mortem reflections for operations teams.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
MySQL Data Recovery Using Binlog Analysis and Reverse Binlog Generation

1. Background and General Idea

On February 25, 2020, a massive failure occurred in WeChat Moments after a large‑scale repost of Weimob, leaving core production data unrecovered for 36 hours. The author uses a similar case—where a developer accidentally deleted production data—to summarize the handling process and provide a warning for operations engineers.

At 23:00 on February 13, a request was received to help recover the data.

System environment:

Operating System: RHEL 7.5

Database: MySQL 5.7 Community Edition, one master and two replicas

At 23:05 the incident response began, and a rough solution plan was defined:

Identify who performed which operation and when.

Assess the impact of the operation on the system and other systems, and verify whether it was a normal business action.

Determine the time window of the affected logs in the database.

Re‑play the incident in a simulation environment.

Develop a technical recovery plan and validate it in the simulation environment.

Verify that the application works correctly after recovery in the simulation environment.

Backup production data and apply the recovery plan to the production environment.

Perform a green‑light test in production; once confirmed, complete the recovery.

Because restoring production data is a major adjustment, leadership approval and a complete rollback plan are required.

2. Data Recovery Process and Technical Analysis

The author spent five minutes clarifying the overall approach, then focused on two main challenges:

1. Determining the start and end points of the binlog to be restored.

2. Based on the binlog range, defining the recovery method and deciding whether other interfering data in that period needs to be excluded.

Solving the First Problem

1. The developer reported that around 20:20 they called a REST API to delete a workflow template, which removed all instances under that template, including five in‑progress processes.

2. The author logged into the workflow platform’s database, located the binlog files around 20:20, and backed up binlog file #11.

3. The binlog was copied to a development server and parsed with

mysqlbinlog -v --base64-output=decode-rows \
--skip-gtids=true --start-datetime='2020-02-13 20:10:00' \
--stop-datetime='2020-02-13 21:30:00' \
-d {$DBNAME} mysql-bin.000011 >> aa.log dbname

.

4. After reviewing the parsed SQL, no massive DELETE statements were found at 20:20, indicating the developer’s description might be inaccurate. The principle is to never trust statements without evidence.

5. Further inspection showed a surge of DELETE and UPDATE statements starting at 20:30, suggesting the problematic time window.

6. The author summarized the affected tables, operation types, and business IDs, then confirmed with workflow platform colleagues that deleting a template indeed touched those tables, giving hope for recovery.

7. An open‑source tool, binlog2sql , can translate binlog events into SQL and also generate reverse SQL, making the problem appear easier to solve.

8. The tool was installed in the simulation environment; it is a Python program requiring a Python runtime and its dependencies.

9. In the simulation environment, the problematic instances were restored, and the workflow platform’s JDBC URL was pointed to the recovered instance.

Solving the Second Problem

1. The simulation environment already reproduced the production failure and had binlog2sql installed.

2. Using binlog2sql, the erroneous SQL statements were extracted and verified with workflow platform engineers to ensure no other applications accessed the database during that window.

3. The workflow team confirmed that all extracted SQL were the result of the mistaken operation.

4. To validate, a workflow template was created in the simulation environment, several test instances were added, the template was deleted via the API, and the generated SQL was compared with the previously extracted statements.

5. The reverse SQL generated by binlog2sql was applied in the simulation environment, but some INSERT statements failed because a table contained a LONGBLOB column, causing garbled data.

6. Since the table is critical for system operation, the author considered generating a reverse binary binlog instead of reverse SQL.

7. A project named MyFlash was found on GitHub, which can generate reverse binary binlogs.

8. Using MyFlash, a reverse binary binlog was created and applied to the database; the workflow platform verified that the data was perfectly restored in the simulation environment.

3. Reflections

Question 1: Why not use backup‑based restoration?

Although daily full backups exist, restoring from a backup would revert the entire workflow platform, affecting many other applications and causing data loss for them.

Question 2: Why not perform table‑level recovery?

The workflow platform’s data model has strong inter‑table relationships; restoring individual tables could break data constraints.

Reflection 1: Why did data loss occur in production?

The developer bypassed the simulation environment and deployed directly to production without strict adherence to the release process.

Reflection 2: Technical capability of the development team

Developers were unfamiliar with the activity workflow and its template modification procedures, indicating a need for skill improvement.

4. Follow‑up Actions

Based on the analysis, the following strategies are recommended to improve the release process:

Automate the release workflow to minimize manual intervention.

Standardize release procedures; all scripts and steps for new applications must be validated before deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

incident managementmysqlData RecoveryDatabase operationsReverse Binlog
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.