Back to blog

Your Backups Are Useless If You've Never Tested a Restore

9 min read
On this page

You have backups. Great. They run every night. The cron job hasn't sent you an error email in months. Everything's green in your monitoring dashboard.

Now answer this: when was the last time you actually restored one?

If the answer is "never" or "I'm not sure," your backups are a hypothesis. You're assuming they work because nothing has told you otherwise. That's not a backup strategy. That's optimism.

Industry numbers back this up. Studies consistently show that a significant percentage of restore attempts fail when they're actually needed, with some reports putting the failure rate as high as 50%. Not because the backup tool is broken, but because nobody tested the full restore path until the moment it mattered.

Why backups fail at restore time

The backup completed successfully. The log says so. But "backup completed" and "data is recoverable" are not the same statement.

Here's what actually goes wrong.

Corruption you didn't detect

Your backup tool wrote the files. It didn't verify them. Disk-level corruption, interrupted writes, or storage hardware degradation can produce backup files that look fine on the surface and fail silently during restore.

Database backups are especially vulnerable. A logical dump can complete without error while silently missing data if a table was being written to during the dump. You won't know until you try to import it and find that your orders table is missing three hours of transactions.

The physical backup preparation trap

If you're using mariabackup or xtrabackup for physical database backups, there's a critical step that trips people up: the --prepare phase.

Physical backups capture InnoDB data files at different points in time during the backup process. The files are internally inconsistent until you run --prepare, which replays the transaction log and brings everything to a consistent state.

Skip this step and try to restore? InnoDB will crash on startup to protect itself from data corruption. This isn't a bug. It's by design. But if you've never done a test restore, you might not know this step exists until 3 AM when your production database just died.

# Take the backup
mariabackup --backup --target-dir=/backup/full

# Prepare it (this is the step people forget)
mariabackup --prepare --target-dir=/backup/full

# Only now can you safely restore
mariabackup --copy-back --target-dir=/backup/full

Permissions and ownership

You restored the files. They're on disk. MariaDB won't start because the data directory is owned by root instead of mysql. Or the file permissions are 600 instead of 660. Or AppArmor is blocking access to the restored files.

These aren't hard to fix. But at 3 AM, with a production database down and customers unable to check out, debugging file permissions is not where you want to spend your time.

Schema and version drift

Your backup is from MariaDB 10.11.6. Your new server is running 10.11.8. Most of the time, this is fine. Sometimes, a minor version bump changes the internal format of full-text search indexes, or the way InnoDB handles tablespace metadata, and your restore fails with errors that look like corruption but are actually version incompatibility.

How we test restores automatically

Testing restores manually is better than not testing at all, but it doesn't scale. If you have to remember to test, you'll forget. If it takes an hour of manual work, you'll put it off. The answer is automation.

Here's the pattern we use across client environments.

The setup

A scheduled job (cron or systemd timer) runs once a week. It spins up a temporary environment, restores from the latest backup, runs validation checks, and tears it down. The whole thing runs unattended. You only hear about it when something fails.

#!/bin/bash
# restore-test.sh - Run weekly via cron
set -euo pipefail

BACKUP_DIR="/backup/latest"
RESTORE_DIR="/tmp/restore-test-$(date +%Y%m%d)"
LOG="/var/log/restore-test.log"

echo "$(date): Starting restore test" >> "$LOG"

# Create a temporary directory for the restore
mkdir -p "$RESTORE_DIR"

# Copy and prepare the backup
cp -r "$BACKUP_DIR" "$RESTORE_DIR/data"
mariabackup --prepare --target-dir="$RESTORE_DIR/data" >> "$LOG" 2>&1

# Start a temporary MariaDB instance on the restored data
mariadbd-safe --datadir="$RESTORE_DIR/data" --port=3307 --socket=/tmp/mariadb-restore-test.sock &
MARIADB_PID=$!
sleep 10

# Run validation queries
mariadb -S /tmp/mariadb-restore-test.sock -e "SELECT COUNT(*) FROM production_db.orders WHERE created_at > DATE_SUB(NOW(), INTERVAL 2 DAY);" >> "$LOG" 2>&1
RESULT=$?

# Shut down the test instance and clean up
kill "$MARIADB_PID" 2>/dev/null
rm -rf "$RESTORE_DIR"

if [ $RESULT -ne 0 ]; then
    echo "$(date): RESTORE TEST FAILED" >> "$LOG"
    # Send alert (email, monitoring, whatever you use)
    echo "Restore test FAILED" | mail -s "RESTORE TEST FAILED" ops@yourcompany.com
    exit 1
fi

echo "$(date): Restore test passed" >> "$LOG"

This is a simplified version. In production, we add checks for specific table row counts (is the orders table within 1% of expected size?), schema validation (do all expected tables and indexes exist?), and timing (did the restore complete within the expected RTO window?).

What to validate

A successful restore isn't just "the database started." You need to verify:

Row counts. Compare critical table sizes against known baselines. If your users table should have 50,000 rows and the restore has 48,000, something went wrong.

Recent data. Check that data from the last backup window exists. If you back up at midnight and the newest order in the restore is from 6 PM, you lost six hours of data. That might be within your RPO, or it might not.

Schema integrity. Run mariadb-check --all-databases on the restored instance. Check for missing indexes, broken foreign keys, or table corruption.

Application-level validation. If possible, point a test instance of your application at the restored database and hit a few key endpoints. If the API returns 200 and the expected data, the restore is good. This catches problems that database-level checks miss (missing config, incompatible schema changes, etc.).

RTO and RPO: the numbers that actually matter

These acronyms get thrown around a lot. Here's what they mean in practice and how to define yours honestly.

RPO (Recovery Point Objective): How much data can you afford to lose? If your last backup was 12 hours ago, your RPO is at most 12 hours. If that's unacceptable for your business (e-commerce orders, financial transactions, user-generated content), you need more frequent backups, binary log shipping, or replication.

RTO (Recovery Time Objective): How long can you be down? This isn't the time it takes to restore a backup. It's the time from "something broke" to "customers can use the product again." That includes detection time, diagnosis, the actual restore, application startup, DNS propagation, cache warming, and whatever else your specific stack needs.

Most teams dramatically underestimate their RTO. They think "20 minutes to restore the database" and forget the 10 minutes to detect the problem, 15 minutes to diagnose it, 5 minutes to start the application, and 10 minutes for health checks and cache warming. Your real RTO is probably 60-90 minutes, not 20.

The only way to know your actual numbers is to test. Run a drill. Time it. Whatever number you get is your real RTO. The number in your incident response document is fiction until you've verified it.

Our backup stack

For context, here's what we run across client environments. The specifics vary by client (different retention requirements, different compliance needs), but the building blocks are the same.

File backups: restic. It's fast, tiny, portable, handles encryption internally, and supports dozens of storage providers out of the box. We've used borg in the past and it's a fine tool, but restic's portability and built-in support for remote backends won out. It just works everywhere without fuss.

Database backups: mydumper for logical backups. It's multi-threaded, which matters when you're dumping databases with hundreds of tables and tens of gigabytes of data. For environments where we need physical backups (large databases where logical dumps take too long), we fall back to mariadb-backup. Most client environments also have a live secondary off-site for real-time replication, so the backup is never the only copy.

Backup frequency: Multiple times a day for most clients. The exact schedule depends on the data: an e-commerce database with constant writes gets backed up more frequently than a content site that changes once a day.

Offsite replication: Every backup goes to at least two locations. Our own backup infrastructure for fast restores, plus an external S3-compatible provider for disaster recovery. Which provider depends on the client's region and compliance requirements. If your backups live on the same server as your data, a disk failure takes both.

Automated restore testing: This varies by client and environment. Some setups have automated restore scripts running on a schedule, others get periodic manual drills. The point isn't any specific cadence. It's that restore testing happens at all, before you need it for real. Alerts go to our monitoring stack and email when something fails.

Retention: At least 14 dailies (or more, depending on backup frequency), with longer retention based on client and legal requirements. restic handles deduplication, so the storage cost of keeping weeks of restore points is minimal.

The checklist

If you take one thing from this post, do these:

Run a test restore this week. Pick your most critical database. Restore it to a temporary instance. See if it works. Time how long it takes. That's your real RTO.

Set up automated restore testing. Even a basic script that restores and checks row counts, running weekly, catches 90% of the problems that would bite you during an actual incident.

Verify your offsite copies. Your backup is on the same server? That's not a backup, that's a copy. Your backup is in the same data center? That's better, but a fire or a network failure takes both.

Check your RPO honestly. How old is the data in your most recent backup? Is that acceptable for your business? If not, increase your backup frequency or set up binary log replication.

We handle this for clients

Backup infrastructure, restore testing, disaster recovery planning. This is a core part of what we do under our server management and monthly retainer packages. If you'd rather not think about this stuff and just know it's handled, talk to us. We'll audit your current backup setup in the first call and tell you where the gaps are.

Written by

Blendbyte

Blendbyte Team

We run what we write about. Production experience only, no theory.

Need help with your setup?

We build and run infrastructure for clients every day. If you need help with your server, cloud, or software setup, talk directly to the engineers who do the work.

Let's talk