Skip to main content

Storage, cloning, and archiving

Storage

Tator uses S3-compatible object storage for digital assets, including videos, images, audio, and generic files. Buckets may be provisioned using an officially supported service: AWS Simple Storage Service, GCP Google Cloud Storage, or MinIO. For any given project, Tator uses up to three buckets: a required live bucket for serving stored media, an optional upload bucket for storing uploaded media prior to transcoding, and an optional backup bucket for storing copies of media in the live bucket. Having three buckets provides flexibility for deployment administrators to configure storage tiers, lifecycle policies, and provisioner for each bucket. Tator allows for specification of project-specific live, upload and/or storage buckets, and falls back to default buckets defined for the deployment if no project-specific bucket is given. Default buckets are specified via Helm chart configuration; project-specific buckets are stored as a database object.

When an object is uploaded to the live bucket, Tator stores the object storage key in an internal database object called a Resource. The resource includes a foreign key to the project-specific bucket where the file is located if applicable as well as foreign keys to any media objects that use the resource. Media objects keep track of which resources it uses by including their object storage key in a field called media_files. This is a JSON field that includes a keys for thumbnails, animated thumbnails, images, streaming videos, download-optimized videos, and attachments. In addition, Media objects store the URL corresponding to the originally uploaded file in a field called source_url. For media uploaded to the upload bucket, this is simply a presigned URL generated for that bucket. Tator also supports import from externally hosted media, so copying to the upload bucket is not required. In this case the source_url would be set to the external URL.

Objects in the live bucket may be accessed by requesting a presigned URL for a given object key through the REST API. A presigned URL is effectively a time-limited token that allows a download or upload of a file or piece of a file. Presigned URLs allow heavy network traffic to go directly to/from the object storage service without being proxied by Tator, facilitating lower latency, faster downloads and uploads, and lower load on Tator's web hosting services.

Cloning

A common use case for uploaded media is to reuse it in a new project. This may be to include it in a larger dataset, to share the data with a different organization, for education or training purposes, etc. Tator includes a media cloning feature that allows this without copying the underlying media files. Media is cloned using the REST API, and requires the user making the clone request to have media transfer permissions for both the source and destination projects. Internally, Tator simply creates a new Media object in the destination project that has the same object keys in its media_files field, then adds the media to the list of parent media tracked by the object's Resource. If one of the clones is deleted, the Media database object is deleted and removed from the Resource's media list, but the actual files used by the media are not deleted until the number of media using the Resource is zero.

Archiving

After media is uploaded, it is common for it to be viewed and analyzed heavily for a time and subsequently it will rarely if ever be accessed. For various reasons, it may be important to keep the media and its annotations, but keeping media in live storage is expensive and unnecessary. Tator provides a media archiving feature that allows lifecycle management policies to transition designated media to a cheaper storage tier. Media is archived through the REST API by patching the Media objects that no longer need to be streamed or downloaded on a regular basis. If the media has been cloned, only media owned by the organization that also owns the originally uploaded file can make an archive or restoration request. A cron job will then search for media that has been marked for archive, and set an object storage tag archive to true on all child object storage keys. If a backup bucket has been specified, the tag will not be set until the media in the live bucket has been backed up. From there, a lifecycle management policy specified by the deployment administrator can move the files into a different storage class. To restore an archived Media, the same procedure is followed, except archived Media are patched to be moved to live storage. For object restoration, first a restoration request is made to the object storage service for each object storage key. Once the object has been moved out of the archival storage class, it is copied to the live storage class, and the Media object state is updated. Note that there are costs associated with restoration, so care must be taken when archiving media.

Read more