Databases 8 min read

How to Efficiently Delete Hundreds of Millions of Rows from a Large MySQL Table Without Indexes

When a 370 GB MySQL table containing 4.7 billion rows must purge about 2 billion records matching a specific xxid value but cannot add an index, a divide‑and‑conquer strategy using primary‑key ranges and a Python script for batch deletes offers a safe, space‑saving solution.

Programmer DD

Jul 5, 2019

How to Efficiently Delete Hundreds of Millions of Rows from a Large MySQL Table Without Indexes

Business requirement: Table a holds roughly 4.7 billion rows (≈370 GB). About 2 billion rows have xxid='xxx' and must be removed, but disk space is tight and creating a new index is not feasible.

Solution 1 – When an index exists

If the xxid column is indexed, a simple loop can delete rows in small batches: delete from a where xxid='xxx' limit 500; The loop repeats until no rows are affected.

Solution 2 – When xxid has no index

Because adding an index would enlarge the table, we delete by primary‑key ranges. The table is processed in segments (e.g., 1 000–2 000 rows per segment). Within each segment we delete rows matching the condition, which is far more efficient than scanning the whole table.

1 select min(a.id) min_id, max(a.id) max_id from a;

2 delete from a where xxid='xxx' and id >= min_id and id <= max_id;

After each batch, min_id is set to max_id and the process repeats.

Python script to automate the process

The script tracks the last processed id in a file, fetches the next min_id / max_id range, deletes the matching rows, updates the file, and optionally sleeps to control the deletion rate.

def get_current_max_id():
    """Return the current maximum id in table a."""
    get_max_id = "select max(a.id) max_id from a"
    try:
        mydb = pymysql.connect(host=IP, port=int(PORT), user=USER, read_timeout=5, write_timeout=5, charset='utf8', autocommit=True)
        cursor = mydb.cursor(pymysql.cursors.DictCursor)
        cursor.execute(get_max_id)
        data = cursor.fetchall()
    except Exception as e:
        print(traceback.format_exc(e))
        exit(0)
    finally:
        mydb.close()
    print("we get max id of table : %s" % data[0]['max_id'])
    return data[0]['max_id']

def get_min_max_id(min_id):
    """Return the min and max id for the next segment starting after <min_id>."""
    get_ids = """select min(a.id) min_id, max(a.id) max_id from (select id from a where id>{init_id} order by id limit 2000) a""".format(init_id=min_id)
    try:
        mydb = pymysql.connect(host=IP, port=int(PORT), user=USER, read_timeout=5, write_timeout=5, charset='utf8', database='test', autocommit=True)
        cursor = mydb.cursor(pymysql.cursors.DictCursor)
        cursor.execute(get_ids)
        data = cursor.fetchall()
    except Exception as e:
        print(traceback.format_exc(e))
        exit(0)
    finally:
        mydb.close()
    return data[0]['min_id'], data[0]['max_id']

def del_tokens(min_id, max_id):
    """Delete rows in the given id range where xxid matches the target value."""
    del_token = """delete from a where xxid='xxx' and id>=%s and id<=%s"""
    try:
        mydb = pymysql.connect(host=IP, port=int(PORT), user=USER, read_timeout=5, write_timeout=5, charset='utf8', database='test', autocommit=True)
        cursor = mydb.cursor(pymysql.cursors.DictCursor)
        rows = cursor.execute(del_token, (min_id, max_id))
    except Exception as e:
        print(traceback.format_exc(e))
        exit(0)
    finally:
        mydb.close()
    return rows

def get_last_del_id(file_name):
    if not os.path.path.exists(file_name):
        print("{file} is not exist, exit.".format(file=file_name))
        exit(-1)
    with open(file_name, 'r') as fh:
        del_id = fh.readline().strip()
    if not del_id.isdigit():
        print("it is '{delid}', not a num, exit".format(delid=del_id))
        exit(-1)
    return int(del_id)

def main():
    file_name = '/tmp/del_aid.id'
    rows_deleted = 0
    maxid = get_current_max_id()
    init_id = get_last_del_id(file_name)
    while True:
        min_id, max_id = get_min_max_id(init_id)
        if max_id > maxid:
            with open('/tmp/del_aid.id', 'w') as f:
                f.write(str(min_id))
            print("delete end at : {end_id}".format(end_id=init_id))
            exit(0)
        rows = del_tokens(min_id, max_id)
        init_id = max_id
        rows_deleted += rows
        print("delete at %d, and we have deleted %d rows" % (max_id, rows_deleted))
        time.sleep(0.3)  # control deletion speed
        with open('/tmp/del_aid.id', 'w') as f:
            f.write(str(min_id))
        if __name__ == '__main__':
            main()

The script records the last processed id in /tmp/del_aid.id so it can resume after interruptions. Initialising this file with 0 or the smallest relevant primary key starts the process.

Further discussion

Readers are invited to suggest faster deletion ideas, ignoring replica lag, and to share their own strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python mysql Batch Delete large data deletion no index

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.