Skip to main content

Media, localizations, and states

This document describes media, localizations, and states in Tator as well as the relationship between them.


There are three top level data types for media objects in Tator: images, videos, and multis. Image and video media objects represent a single uploaded file, but through Tator's ingestion workflows several files may be stored on the platform for each media. These files may include thumbnails, animated thumbnails, streaming formats of various resolutions, download optimized formats, attachments, and/or fragment metadata. A multi type media is simply a grouping of multiple synced videos. It contains data that defines the UI layout of child videos as well as child media IDs. Tator can support input files with resolution up to 8K; download optimized formats can support native resolution, however streaming formats are limited to 1080p.


A localization is a geometric primitive attached to a specific video frame or image. There are four types of localization in Tator: dots, lines, boxes, and polygons. Each localization object includes a common set of geometry fields, but only a subset is used for each geometry type. Specifically, dots use the x and y fields only, lines use x, y, u, v, boxes use x, y, width, height, and polygons use a points field that is a list of x/y pairs. The coordinate system origin for these values is the top left of the image or frame. x, u, and width correspond to the horizontal direction. y, v, and height correspond to the vertical direction. All values are normalized by the width and height of the image or frame, such that the maximum value for each value is 1.0. The normalized coordinate system facilitates consistent behavior across varied resolutions.


A state is a collection of other objects. There are three types of state associations: media, localizations, and frames. A media associated state can be thought of as a media collection or playlist. A localization associated state is synonymous with a "track", and is referred to in this way in the annotation view. Typically tracks consist of at most one localization per video frame, but this is not a requirement. A frame associated state can represent a single discrete event (in which case the state is only associated to one frame), or a segment of time corresponding to an activity in a video. The interpretation of frame associated states is controlled by the state type definition's interpolation field: no interpolation means each frame must be included explicitly to be included, latest means only the first frame in an activity must be specified, and a range means the first and last frames in an activity must be specified.

Read more