Skip to content

Why got invalid checkpoint record during stolon restore? #927

@voipp

Description

@voipp

I have got stolon cluster working well.
3 active sentinels, proxies, keepers. From stolonctl status i see that all healthy.
In my pgParameters archive_command: wal-g wal-push %p
I have minio env variables set in keeper container.
I run backuping: wal-g backup-push and then check backups in minio wal-g backup-list and i can see my backups. ALso i trigger wal backuping: SELECT pg_switch_wal()

After hours i decide to restore to the state:

stolonctl init -y '{"initMode":"pitr","pitrConfig":{"dataRestoreCommand":"wal-g backup-fetch \"%d\" LATEST","archiveRecoverySettings":{"restoreCommand": "wal-g wal-fetch \"%f\" \"%p\"" }}}'

After that, my database stops with exception:

2025-08-18 16:56:24.099 UTC [2359] LOG:  database system was interrupted; last known up at 2025-08-18 15:31:08 UTC
2025-08-18 16:56:24.100 UTC [2359] LOG:  creating missing WAL directory "pg_wal/archive_status"
2025-08-18 16:56:24.234 UTC [2359] LOG:  restored log file "00000002.history" from archive
2025-08-18 16:56:24.299 UTC [2359] LOG:  restored log file "00000003.history" from archive
ERROR: 2025/08/18 16:56:24.389385 Archive '00000004.history' does not exist.
2025-08-18 16:56:24.390 UTC [2359] LOG:  starting archive recovery
2025-08-18 16:56:24.409 UTC [2359] LOG:  restored log file "00000003.history" from archive
2025-08-18 16:56:24.820 UTC [2359] LOG:  restored log file "000000010000000000000005" from archive
2025-08-18 16:56:24.887 UTC [2359] LOG:  invalid checkpoint record
2025-08-18 16:56:24.887 UTC [2359] FATAL:  could not locate required checkpoint record
2025-08-18 16:56:24.887 UTC [2359] HINT:  If you are restoring from a backup, touch "/var/lib/postgresql/data/postgres/recovery.signal" and add required recovery options.
        If you are not restoring from a backup, try removing the file "/var/lib/postgresql/data/postgres/backup_label".
        Be careful: removing "/var/lib/postgresql/data/postgres/backup_label" will result in a corrupt cluster if restoring from a backup.
2025-08-18 16:56:24.888 UTC [2357] LOG:  startup process (PID 2359) exited with exit code 1
2025-08-18 16:56:24.888 UTC [2357] LOG:  aborting startup due to startup process failure
2025-08-18 16:56:24.891 UTC [2357] LOG:  database system is shut down

Why, the hell, restoring goes wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions