How to Efficiently Delete Hundreds of Millions of Rows from a Large MySQL Table Without Indexes
When a 370 GB MySQL table containing 4.7 billion rows must purge about 2 billion records matching a specific xxid value but cannot add an index, a divide‑and‑conquer strategy using primary‑key ranges and a Python script for batch deletes offers a safe, space‑saving solution.
Business requirement: Table a holds roughly 4.7 billion rows (≈370 GB). About 2 billion rows have xxid='xxx' and must be removed, but disk space is tight and creating a new index is not feasible.
Solution 1 – When an index exists
If the xxid column is indexed, a simple loop can delete rows in small batches: delete from a where xxid='xxx' limit 500; The loop repeats until no rows are affected.
Solution 2 – When xxid has no index
Because adding an index would enlarge the table, we delete by primary‑key ranges. The table is processed in segments (e.g., 1 000–2 000 rows per segment). Within each segment we delete rows matching the condition, which is far more efficient than scanning the whole table.
1 select min(a.id) min_id, max(a.id) max_id from a; 2 delete from a where xxid='xxx' and id >= min_id and id <= max_id;After each batch, min_id is set to max_id and the process repeats.
Python script to automate the process
The script tracks the last processed id in a file, fetches the next min_id / max_id range, deletes the matching rows, updates the file, and optionally sleeps to control the deletion rate.
def get_current_max_id():
"""Return the current maximum id in table a."""
get_max_id = "select max(a.id) max_id from a"
try:
mydb = pymysql.connect(host=IP, port=int(PORT), user=USER, read_timeout=5, write_timeout=5, charset='utf8', autocommit=True)
cursor = mydb.cursor(pymysql.cursors.DictCursor)
cursor.execute(get_max_id)
data = cursor.fetchall()
except Exception as e:
print(traceback.format_exc(e))
exit(0)
finally:
mydb.close()
print("we get max id of table : %s" % data[0]['max_id'])
return data[0]['max_id']
def get_min_max_id(min_id):
"""Return the min and max id for the next segment starting after <min_id>."""
get_ids = """select min(a.id) min_id, max(a.id) max_id from (select id from a where id>{init_id} order by id limit 2000) a""".format(init_id=min_id)
try:
mydb = pymysql.connect(host=IP, port=int(PORT), user=USER, read_timeout=5, write_timeout=5, charset='utf8', database='test', autocommit=True)
cursor = mydb.cursor(pymysql.cursors.DictCursor)
cursor.execute(get_ids)
data = cursor.fetchall()
except Exception as e:
print(traceback.format_exc(e))
exit(0)
finally:
mydb.close()
return data[0]['min_id'], data[0]['max_id']
def del_tokens(min_id, max_id):
"""Delete rows in the given id range where xxid matches the target value."""
del_token = """delete from a where xxid='xxx' and id>=%s and id<=%s"""
try:
mydb = pymysql.connect(host=IP, port=int(PORT), user=USER, read_timeout=5, write_timeout=5, charset='utf8', database='test', autocommit=True)
cursor = mydb.cursor(pymysql.cursors.DictCursor)
rows = cursor.execute(del_token, (min_id, max_id))
except Exception as e:
print(traceback.format_exc(e))
exit(0)
finally:
mydb.close()
return rows
def get_last_del_id(file_name):
if not os.path.path.exists(file_name):
print("{file} is not exist, exit.".format(file=file_name))
exit(-1)
with open(file_name, 'r') as fh:
del_id = fh.readline().strip()
if not del_id.isdigit():
print("it is '{delid}', not a num, exit".format(delid=del_id))
exit(-1)
return int(del_id)
def main():
file_name = '/tmp/del_aid.id'
rows_deleted = 0
maxid = get_current_max_id()
init_id = get_last_del_id(file_name)
while True:
min_id, max_id = get_min_max_id(init_id)
if max_id > maxid:
with open('/tmp/del_aid.id', 'w') as f:
f.write(str(min_id))
print("delete end at : {end_id}".format(end_id=init_id))
exit(0)
rows = del_tokens(min_id, max_id)
init_id = max_id
rows_deleted += rows
print("delete at %d, and we have deleted %d rows" % (max_id, rows_deleted))
time.sleep(0.3) # control deletion speed
with open('/tmp/del_aid.id', 'w') as f:
f.write(str(min_id))
if __name__ == '__main__':
main()The script records the last processed id in /tmp/del_aid.id so it can resume after interruptions. Initialising this file with 0 or the smallest relevant primary key starts the process.
Further discussion
Readers are invited to suggest faster deletion ideas, ignoring replica lag, and to share their own strategies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
