This is an internal documentation. There is a good chance you’re looking for something else. See Disclaimer.

S3 Backups and Object Removal

Terminology

_nice_binary

Database table in Nice tracking objects. All objects in Nice are tracked in this table.

Dead object

An object is considered dead when it has been removed in Nice. Yet, it may still exist in S3.

Live object

An object is considered live as long as it is still used in any production or test installation of Nice.

Object with _nice_binary.reference_count = 0 are considered alive even though there are unused by Nice. Reference counts are not fully reliable and should not be trusted blindly. See also warning below.

N days

Time between removal of an object from all installations belonging to a bucket to its removal in S3.

State DB

DB used to track when an object was last used by an installation.

Implementation

The implementation of this specification can be found in the tocco-s3.

_nice_binary and Reference Counting

Table _nice_binary tracks objects in Nice:

Column

Type

Comment

size

bigint

mime_type

character varying(255)

file_extension

character varying(255)

reference_count

integer

Reference count

Values:

-1 = reference count uninitialized
0 = object is unused
>0 = object is used

Do not rely on object with a reference_count of zero to be unused. See warning below.

created_at

timestamp with time zone

Time the object was created.

Careful, created_at may be NULL.

hash

character varying(255)

SHA2 hash of object as lower-case hex.

Objects are deduplicated by means of reference counting. Once the reference counter reaches zero, an object is considered unused and removed.

There are several reasons for this behavior:

  1. Removal of S3 objects is slow. Doing so synchronously would lead to delays.

  2. In case a backup of the database needs to be restored, there is no need to restore a backup of the objects too.

  3. There is no way to create atomic and thus consistent backups, i.e. backups where the database and S3 storage are guaranteed to contain the same set of objects.

  4. Buckets are shared by all installations of a customer and there is no reliable way to determine whether an object is still in use by any other installation.

  5. Objects cannot be removed atomically as part of a DB transaction. Objects would need to be removed after it’s removal was committed to the DB.

Warning

Reliability of reference_count

reference_count is not fully reliable and can reach zero while there are still active references. Do not rely on the counter to determine if an object is still used.

Only once an object has been removed from _nice_binary, you should trust that it no longer used (on the corresponding DB). There are foreign key constraint preventing a row in nice_binary from being removed while there is still any active references.

Object Removal

Only installations are taken into consideration when deciding whether a object is retained. More precisely, only DBs on the production and staging clusters are taken into considerations. Other users using these objects are not taken into account.

In order to avoid that other users have needed objects removed, a delay of N days is added between an object being removed in Nice and it being removed on S3. See graph below. This way, it is ensured that, for N days, an S3 object belonging to a DB backup, a development database or an elsewhere stored DB remains available.

digraph {
  label="Subjects accessing a single bucket"
  rankdir=LR

  subgraph cluster_inst {
      label="Installations"
      fontcolor=blue
      color=blue

      Prod
      Test
  }

  subgraph cluster_dev {
      label="Developers"

      Peter
      Zarah
      Arya
  }

  subgraph cluster_backup {
      label="Backup"

      backup [ label="Bucket backups" ]
      mirror [ label="Mirror", style="dotted" ]
  }

  subgraph cluster_restore {
      label="Restore"

      test_restore [ label="Restore of prod on test" ]
      prod_restore [ label="Restore of prod" ]
  }

  subgraph cluster_s3 {
      label="S3 storage"

      bucket_abc [ label="Bucket of\ncustomer \"abc\"" ]
  }

  { Peter, Zarah, Arya, Prod, Test, test_restore, prod_restore,
    backup, mirror } -> bucket_abc
}

There is broadly four category of users that access a bucket:

Category

Description

Installations

At least one prod and one test system need access.

Developers

Developers have to be able to run Tocco locally with a copy of prod, test or a restore thereof from a backup.

Backup

Our backup tool needs to be able to sync all object from a bucket and needs to be able to remove object.

Mirroring to a different bucket is not currently implemented but listed here as a future option. A mirror would need to sync removals too (possibly with some delay).

Restore

For a restore, full access to a bucket is needed.

Danger of Early Removal

When it comes to avoiding early removal of objects, close attention needs to be paid to database backups and restores. To ease restores, an S3 object is kept as long as there is a possibility that it is referenced in any DB backup (plus some safety margin). This is referenced as N days in other places.

If we want to ensure S3 objects don’t have to be restored from backup when a DB is restored, we have to ensure sufficient delay before removing an objects from S3:

digraph {
  label="Backup and Restore"

  object_created [ label="Object Created" ]
  object_alive [ shape=none, label="Object is Alive" ]
  object_dead [ shape=none, label="Object is Dead" ]
  object_removed_nice [ label="Object removed in Nice" ]
  object_removed_s3 [ label="Object removed in S3" ]

  db_backup [ label="DB Backed Up", color=green ]
  db_restore [ label="DB Restored", color=green ]

  object_created -> object_alive -> object_removed_nice -> object_dead -> object_removed_s3
  db_backup -> db_restore [ color=green, fontcolor=green, label="N days" ]
  { object_alive db_backup rank=same }
  { object_dead db_restore rank=same }
}

Implementation Details

List of Servers

Objects can be used on various DB servers:

digraph {
  label="Fetch List of Used Objects"
  rankdir=LR

  backup_server -> {
      "db1.prod",
      "db3.prod",
      "db5.prod",
      "db1.stage",
      "db3.stage"
  } [ label="" ]
}

Our backup tool needs to know what DB servers exist in order to be able to retrieve a list of used objects from every server. The list of servers is updated via Ansible’s based on servers configured in config.yml.

An Up-To-Date Server List

It is crucial that all DB servers are scanned. It is particularly important to handle when a new server is added. If a server is skipped, objects only referenced on that machine will be considered unreferenced and, ultimately, be removed.

To ensure this is noticed in a timely manner, a new endpoint is added to fetch the list of configured servers:

$ ssh ${S3_BACKUP_SERVER} tocco-s3 list-servers
db1.prod.tocco.cust.vsh.net
db1.stage.tocco.cust.vsh.net
…

command= option in authorized_keys(5) is used to bind the command to the connecting key and only allow executing the one command specified.

Whenever operating on an installations, including creating or modifying one, Ansible connects to the endpoint and verifies the server currently configured for the installation is in the set of servers returned by the endpoint. Ansible aborts with an error if this is not the case.

Note

Current Implementation

DB servers are configured in config.toml which is generated by Ansible.

See Ansible role s3backup.

List of Objects

A list of live objects is fetched via ssh from every DB server:

ssh $DB_SERVER nice2_list_live_objects

Returned is a list of hashes; hashes are hex-encoded, lower-case followed by ‘,’ and the name of the corresponding DB. One object is returned per line:

72ce10090d03f49430469c641f45ed3aa02d99cb766079aa231a6c19b52af2ee,nice_sbk
ca45c2de6d54502a061cdce066ee43a4211d5b3b772b6e36c83a84fca01fb50a,nice_bbg
a9a361f131b5301c40a404b8580f6ea777022ade8e1d704f826cc0e56693bde8,nice_bbg

The same hash may be included in the list multiple times. Reference counting can be inaccurate and, historically, various objects have reached a reference_count of zero while still being referenced. Thus, all objects are considered, even those with a reference count of zero. Foreign key constraints prevent a referenced object from being removed from the Nice DB. Hence, the absence of a row is a reliable indicator that an object is unused within a database.

Returned hashes should be verified to be a valid SHA2 hashes and invalid hashes should be rejected with a hard error to ensure objects are not silently ignored.

Database name is not used later on other than to print better diagnostic messages.

command= option in authorized_keys(5) is used to bind the command to the connecting key and only allow executing the one command specified.

Any database containing a _nice_binary table is scanned for objects.

Important

It is crucuial that no objects are removed when fetching live objects from any server fails. Continuing could mean that some objects are not correctly accounted for as alive.

An --created-before <ISO_8601_TIMESTAMP> option is available to restrict returned objects to objects created before given ISO 8601 timestamp. See Integrity Checks.

Warning

The created_at column may contain NULL values. Be sure that –created-before returns rows where created_at is NULL.

Important

The client calling the SSH endpoint is considered untrusted and care needs to be taken when parsing supplied parameters.

Tracking Live Objects

In order to track which objects have been removed in Nice, the following metadata about every object is stored in a state DB:

Name

Description

bucket

Bucket name

hash

SHA2 hash of an object

last seen

Last time we’ve seen this object

Note

Objects are tracked globally rather then per bucket. This to reduce complexity. Tracking per bucket would greatly increase complexity as a DB would have to be associated with their bucket. This information currently only exists in an incomplete form in Ansible’s config.yml.

Objects are deleted once N days have passed since last seen timestamp or, in other words, once an object has been removed/dead in Nice for N days

Warning

The current design does not handle a the following race condition:

  1. An object is removed in Nice.

  2. Scans indicate object has been removed and is dead.

  3. N days pass.

  4. Scan still indicating object is (still) dead.

  5. Object gets recreated

  6. Object gets removed in S3

There is no easy way to avoid this race but it’s unlikely to happen in a real-world scenario and the integrity checks will detect it.

Last seen is updated when an object is observed to be alive. An object is considered alive when a) it still exists within any Nice DB (see List of Objects), b) it has been created in the S3 bucket (i.e if it existed in S3 but not yet in the state DB).

Note

The additional scanning of the S3 buckets ensures objects are observed when a DB scan cannot detect them. This can happen when a) an object is removed (or the whole DB dropped) before an object is observed in a DB scan, and b) when an object is created during local development.

While there is no need to back up objects created during development, it’s still desirable to track them to ensure they’ll get removed again.

There is potentially millions of objects. Care needs to be taken that memory use is kept at a reasonable level but lookups for hashes are still fast.

Objects in S3 whose key is not a valid sha2 hash, as encoded by Nice, are backed up as a precautionary measure, and updated if the modification times changes. However, there is no reliable way to do a partial restore as the modification time on S3 cannot be adjusted. See tocco-s3 restore --help.

Current implementation

Implemented as SQLite DB. See state::Db.

Actual Removal

When removing objects, it’s important that the state DB is up-to-date. That is that there was a successful and complete update of the state DB.

Recover from Accidental Removal

All objects in S3 are synced to disk on s3backup2.tocco.cust.vshn.net, daily BTRFS snapshots are taken and additionally everything is archived via Burp and Borg. When a objects is removed in S3, the on-disk backups is removed too. On disk BTRFS snapshots can be used for a restore, and worst case, data can be restore from Burp or Borg archives.

Current implementation

Implemented in tocco-s3 as tocco-s3 prune.

In order to ensure the state DB is up-to-date, the timestamp of the last successful update is stored in the state DB:

$ sudo -u s3backup sqlite3 /var/lib/s3backup/state.sqlite
sqlite> select * from key_value where key = 'last_db_scan';
last_db_scan|2023-09-14 07:27:15

last_db_scan is when the DBs were last scanned successfully and the last seen timestamps updated. tocco-s3 will refuse to remove any objects if the last successful scan was more than 8 days ago to prevent the removal of objects because of an outdated state DB.

Integrity Checks

Check if Objects are in Backup Archive

A weekly integrity check fetches all objects older than 28h and asserts their existence in the backups. This is done using the --created-before option in List of Objects.

The 28h delay is there because syncing from S3 to the backups is done daily and, thus, newer objects may not be part of the backup archive yet.

This is used to help detect consistency issues and an alert is sent out when this integrity check fails.

Note

One important case that is not detected is when a new server isn’t scanned for objects because we didn’t get informed about the new server.

See An Up-To-Date Server List

Current implementation

This is implemented in tocco-s3 as tocco-s3 check-availability.

There is also a tocco-s3 check-content to check the content against its hash which is its key. This check, however, is not connected to object removal.