How pg_recovery Retrieves Deleted Rows in PostgreSQL: Design & Code Walkthrough
This article explains the implementation principles, design ideas, and source‑code analysis of PostgreSQL's pg_recovery tool, detailing how it reads dead tuples, controls memory, and returns data through a custom function with configurable parameters.
Author: Zhang Lianzhuang, Qingyun Technology PostgreSQL R&D Engineer
In the previous issue we introduced the PostgreSQL data recovery tool pg_recovery . This article explains the implementation principle, design ideas, and provides a source‑code walkthrough of the pg_recovery tool.
Data Recovery Implementation Principle
A normal data read in PostgreSQL starts with a query such as select * from pg_recovery, which creates a transaction snapshot. The function GetActiveSnapshot() reveals the currently visible data.
Design Idea
How to Read Dead Tuples?
PostgreSQL uses snapshots to decide data visibility; when a row is deleted, its physical tuple remains as a dead tuple. The special SnapshotAny snapshot can read any tuple, and pg_recovery uses this to read all data, returning only recovery rows by default.
How Many Rows Does the Function Return?
Data is returned row by row, one row per call.
How to Control Memory?
The function may be invoked multiple times; global state can be managed using the multi_call_memory_ctx memory‑context parameter.
Function Parameters
Creating the function via SQL:
CREATE FUNCTION pg_recovery(regclass, recoveryrow bool DEFAULT true) RETURNS SETOF recordregclass : PostgreSQL table type that automatically converts a table name to its OID.
recoveryrow bool DEFAULT true : When true, only recovery rows are returned; false returns all rows.
Example query to change the default parameter:
select * from pg_recovery('aa', recoveryrow => false)Source Code Analysis
Necessary Data
typedef struct {
Relation rel; -- current table
TupleDesc reltupledesc; -- table metadata
TupleConversionMap *map; -- column mapping for custom output
TableScanDesc scan; -- table scan
HTAB *active_ctid; -- visible data ctid
bool droppedcolumn; -- column dropped flag
} pg_recovery_ctx;Hidden Columns
A hidden column recoveryrow is added so that when all information is returned, this column indicates whether a row is a recovery row or a user‑visible row.
static const struct system_columns_t {
char *attname;
Oid atttypid;
int32 atttypmod;
int attnum;
} system_columns[] = {
{ "ctid", TIDOID, -1, SelfItemPointerAttributeNumber },
{ "xmin", XIDOID, -1, MinTransactionIdAttributeNumber },
{ "cmin", CIDOID, -1, MinCommandIdAttributeNumber },
{ "xmax", XIDOID, -1, MaxTransactionIdAttributeNumber },
{ "cmax", CIDOID, -1, MaxCommandIdAttributeNumber },
{ "tableoid", OIDOID, -1, TableOidAttributeNumber },
{ "recoveryrow", BOOLOID, -1, DeadFakeAttributeNumber },
{ 0 }
};pg_recovery Simplified Code
Datum
pg_recovery(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
pg_recovery_ctx *usr_ctx;
recoveryrow = PG_GETARG_BOOL(1); -- get default argument
if (SRF_IS_FIRSTCALL())
{
funcctx = SRF_FIRSTCALL_INIT();
oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
usr_ctx->rel = heap_open(relid, AccessShareLock);
usr_ctx->reltupledesc = RelationGetDescr(usr_ctx->rel);
funcctx->tuple_desc = BlessTupleDesc(tupdesc);
usr_ctx->map = recovery_convert_tuples_by_name(usr_ctx->reltupledesc,
funcctx->tuple_desc, "Error converting tuple descriptors!", &usr_ctx->droppedcolumn);
usr_ctx->scan = heap_beginscan(usr_ctx->rel, SnapshotAny, 0, NULL, NULL, 0);
active_scan = heap_beginscan(usr_ctx->rel, GetActiveSnapshot(), 0, NULL, NULL, 0);
while ((tuplein = heap_getnext(active_scan, ForwardScanDirection)) != NULL)
hash_search(usr_ctx->active_ctid, (void*)&tuplein->t_self, HASH_ENTER, NULL);
}
funcctx = SRF_PERCALL_SETUP();
usr_ctx = (pg_recovery_ctx *) funcctx->user_fctx;
get_tuple:
if ((tuplein = heap_getnext(usr_ctx->scan, ForwardScanDirection)) != NULL)
{
hash_search(usr_ctx->active_ctid, (void*)&tuplein->t_self, HASH_FIND, &alive);
tuplein = recovery_do_convert_tuple(tuplein, usr_ctx->map, alive);
SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuplein));
}
else
{
heap_endscan(usr_ctx->scan);
heap_close(usr_ctx->rel, AccessShareLock);
SRF_RETURN_DONE(funcctx);
}
}Generate Mapping Table
TupleConversionMap *
recovery_convert_tuples_by_name(TupleDesc indesc,
TupleDesc outdesc,
const char *msg, bool *droppedcolumn)
{
attrMap = recovery_convert_tuples_by_name_map(indesc, outdesc, msg, droppedcolumn);
map->indesc = indesc;
map->outdesc = outdesc;
map->attrMap = attrMap;
map->outvalues = (Datum *) palloc(n * sizeof(Datum));
map->outisnull = (bool *) palloc(n * sizeof(bool));
map->invalues = (Datum *) palloc(n * sizeof(Datum));
map->inisnull = (bool *) palloc(n * sizeof(bool));
map->invalues[0] = (Datum) 0;
map->inisnull[0] = true;
return map;
}Tuple Conversion Function
HeapTuple
recovery_do_convert_tuple(HeapTuple tuple, TupleConversionMap *map, bool alive)
{
heap_deform_tuple(tuple, map->indesc, invalues + 1, inisnull + 1);
for (i = 0; i < outnatts; i++)
{
outvalues[i] = invalues[j];
outisnull[i] = inisnull[j];
}
return heap_form_tuple(map->outdesc, outvalues, outisnull);
}Qingyun Technology Community
Official account of the Qingyun Technology Community, focusing on tech innovation, supporting developers, and sharing knowledge. Born to Learn and Share!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
