Databases 7 min read

How pg_recovery Retrieves Deleted Rows in PostgreSQL: Design & Code Walkthrough

This article explains the implementation principles, design ideas, and source‑code analysis of PostgreSQL's pg_recovery tool, detailing how it reads dead tuples, controls memory, and returns data through a custom function with configurable parameters.

Qingyun Technology Community
Qingyun Technology Community
Qingyun Technology Community
How pg_recovery Retrieves Deleted Rows in PostgreSQL: Design & Code Walkthrough

Author: Zhang Lianzhuang, Qingyun Technology PostgreSQL R&D Engineer

In the previous issue we introduced the PostgreSQL data recovery tool pg_recovery . This article explains the implementation principle, design ideas, and provides a source‑code walkthrough of the pg_recovery tool.

Data Recovery Implementation Principle

A normal data read in PostgreSQL starts with a query such as select * from pg_recovery, which creates a transaction snapshot. The function GetActiveSnapshot() reveals the currently visible data.

Design Idea

How to Read Dead Tuples?

PostgreSQL uses snapshots to decide data visibility; when a row is deleted, its physical tuple remains as a dead tuple. The special SnapshotAny snapshot can read any tuple, and pg_recovery uses this to read all data, returning only recovery rows by default.

How Many Rows Does the Function Return?

Data is returned row by row, one row per call.

How to Control Memory?

The function may be invoked multiple times; global state can be managed using the multi_call_memory_ctx memory‑context parameter.

Function Parameters

Creating the function via SQL:

CREATE FUNCTION pg_recovery(regclass, recoveryrow bool DEFAULT true) RETURNS SETOF record

regclass : PostgreSQL table type that automatically converts a table name to its OID.

recoveryrow bool DEFAULT true : When true, only recovery rows are returned; false returns all rows.

Example query to change the default parameter:

select * from pg_recovery('aa', recoveryrow => false)

Source Code Analysis

Necessary Data

typedef struct {
    Relation            rel;            -- current table
    TupleDesc           reltupledesc;   -- table metadata
    TupleConversionMap *map;            -- column mapping for custom output
    TableScanDesc       scan;           -- table scan
    HTAB               *active_ctid;   -- visible data ctid
    bool                droppedcolumn;  -- column dropped flag
} pg_recovery_ctx;

Hidden Columns

A hidden column recoveryrow is added so that when all information is returned, this column indicates whether a row is a recovery row or a user‑visible row.

static const struct system_columns_t {
    char *attname;
    Oid   atttypid;
    int32 atttypmod;
    int   attnum;
} system_columns[] = {
    { "ctid",       TIDOID,  -1, SelfItemPointerAttributeNumber },
    { "xmin",       XIDOID,  -1, MinTransactionIdAttributeNumber },
    { "cmin",       CIDOID,  -1, MinCommandIdAttributeNumber },
    { "xmax",       XIDOID,  -1, MaxTransactionIdAttributeNumber },
    { "cmax",       CIDOID,  -1, MaxCommandIdAttributeNumber },
    { "tableoid",   OIDOID,  -1, TableOidAttributeNumber },
    { "recoveryrow", BOOLOID, -1, DeadFakeAttributeNumber },
    { 0 }
};

pg_recovery Simplified Code

Datum
pg_recovery(PG_FUNCTION_ARGS)
{
    FuncCallContext *funcctx;
    pg_recovery_ctx *usr_ctx;

    recoveryrow = PG_GETARG_BOOL(1); -- get default argument

    if (SRF_IS_FIRSTCALL())
    {
        funcctx = SRF_FIRSTCALL_INIT();
        oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);

        usr_ctx->rel = heap_open(relid, AccessShareLock);
        usr_ctx->reltupledesc = RelationGetDescr(usr_ctx->rel);
        funcctx->tuple_desc = BlessTupleDesc(tupdesc);
        usr_ctx->map = recovery_convert_tuples_by_name(usr_ctx->reltupledesc,
                funcctx->tuple_desc, "Error converting tuple descriptors!", &usr_ctx->droppedcolumn);
        usr_ctx->scan = heap_beginscan(usr_ctx->rel, SnapshotAny, 0, NULL, NULL, 0);
        active_scan = heap_beginscan(usr_ctx->rel, GetActiveSnapshot(), 0, NULL, NULL, 0);
        while ((tuplein = heap_getnext(active_scan, ForwardScanDirection)) != NULL)
            hash_search(usr_ctx->active_ctid, (void*)&tuplein->t_self, HASH_ENTER, NULL);
    }

    funcctx = SRF_PERCALL_SETUP();
    usr_ctx = (pg_recovery_ctx *) funcctx->user_fctx;

get_tuple:
    if ((tuplein = heap_getnext(usr_ctx->scan, ForwardScanDirection)) != NULL)
    {
        hash_search(usr_ctx->active_ctid, (void*)&tuplein->t_self, HASH_FIND, &alive);
        tuplein = recovery_do_convert_tuple(tuplein, usr_ctx->map, alive);
        SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuplein));
    }
    else
    {
        heap_endscan(usr_ctx->scan);
        heap_close(usr_ctx->rel, AccessShareLock);
        SRF_RETURN_DONE(funcctx);
    }
}

Generate Mapping Table

TupleConversionMap *
recovery_convert_tuples_by_name(TupleDesc indesc,
                                TupleDesc outdesc,
                                const char *msg, bool *droppedcolumn)
{
    attrMap = recovery_convert_tuples_by_name_map(indesc, outdesc, msg, droppedcolumn);
    map->indesc = indesc;
    map->outdesc = outdesc;
    map->attrMap = attrMap;
    map->outvalues = (Datum *) palloc(n * sizeof(Datum));
    map->outisnull = (bool *) palloc(n * sizeof(bool));
    map->invalues = (Datum *) palloc(n * sizeof(Datum));
    map->inisnull = (bool *) palloc(n * sizeof(bool));
    map->invalues[0] = (Datum) 0;
    map->inisnull[0] = true;
    return map;
}

Tuple Conversion Function

HeapTuple
recovery_do_convert_tuple(HeapTuple tuple, TupleConversionMap *map, bool alive)
{
    heap_deform_tuple(tuple, map->indesc, invalues + 1, inisnull + 1);
    for (i = 0; i < outnatts; i++)
    {
        outvalues[i] = invalues[j];
        outisnull[i] = inisnull[j];
    }
    return heap_form_tuple(map->outdesc, outvalues, outisnull);
}
PostgreSQLsnapshotDatabase Recoverydead tuplesPG recovery
Qingyun Technology Community
Written by

Qingyun Technology Community

Official account of the Qingyun Technology Community, focusing on tech innovation, supporting developers, and sharing knowledge. Born to Learn and Share!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.