Skip to main content

Archive and restore media with python

This tutorial will guide you through archiving and restoring media using tator-py. We assume you already have a project with media in it.

Determine which media to archive

In this example, we will archive all media objects from a section. First, get the list of media ids that will be archived:

import tator
api = tator.get_api(host="https://cloud.tator.io", token="YOUR_TOKEN")
media_list = api.get_media_list(project_id, section=section_id)
media_ids = [m.id for m in media_list]
note

The section id can be found in the web ui when you select the section from the list on the left side of the project detail page.

Archive the media

Users don't actually perform the archive operation, they mark media objects as ready to be archived and a nightly cron job tags the object in the bucket to trigger a lifecycle rule. To mark objects as ready to archive, perform a bulk update where the archive_state property is set to to_archive:

bulk_update = {"archive_state": "to_archive"}
response = api.update_media_list(project_id, media_bulk_update=bulk_update)
print(response)

What happens next

After the user has set the archive_state flag to to_archive, the following happens:

  1. The next time the nightly cron job runs, the archive tag is added to the object in the bucket (e.g. S3) and is given the value true.
  2. The next time the lifecycle rule polls for new archive: true tags, it will run on those objects. Amazon S3 runs lifecycle rules once every day, but the timing is undocumented.
note

The actual result of the bucket's lifecycle rule depends on the project configuration. If the project does not have a backup bucket defined (either a deployment-wide default or a project-specific backup bucket), then the rule will be set up to transition the storage class from the previously defined live storage class to the archive storage class (e.g. DEEP_ARCHIVE for S3). If the project does have a backup bucket defined, then the rule will delete the object from the live bucket, leaving only the backup copy in the backup bucket (which usually defaults to a high-latency, low-cost storage class like DEEP_ARCHIVE).

Restore the media

Once archived, it is possible to restore media to the live state. The user does this the same way they archived the media, by performing a bulk update:

bulk_update = {"archive_state": "to_live"}
response = api.update_media_list(project_id, media_bulk_update=bulk_update)
print(response)

This value will be read by a cron job that will request that the object store temporarily move the object in question into the live storage class (e.g. STANDARD for S3). This request is asynchronous and can take up to 48 hours, so there is a second job that looks for the completion of this request and performs the final step to permanently restore the object in the live bucket and at the live storage class. After a user performs the bulk update to to_live, the order and rough timing of these steps is as follows:

  1. The next time the nightly request restoration cron job runs, it sends a request to temporarily restore the object to the live storage class in its current bucket (e.g. in the backup bucket, if the project has one, otherwise in the live bucket). It also sets the restoration_requested flag on the media object to True, signaling the finish restoration cron job to run on this media.This process may take up to 48 hours, so it might take more than one day before the next step runs.
  2. Once the object is restored to the live (e.g. accessible) storage class, the next run of the finish restoration cron job will permanently restore the object to the live storage class. If the project has a backup bucket, this means the object will be copied from the backup bucket to the live bucket and the object in the backup bucket will "expire" and drop back into the archived storage class, leaving the backup intact.