Data
Flow manages a large collection of data files - those uploaded, those produced by pipeline executions, and those imported from elsewhere on the server. Each of these files on disk is represented in the database.
The Data Model
In Flow, the data model represents a file or directory. When you search data, you are searching the corresponding record in the database, not the actual files on disk.
Generally, data in Flow exists one of two places: uploaded data is stored in the UPLOADS_ROOT
, and data generated by pipeline executions is stored in the work directory of executions, which are stored in the EXECUTIONS_ROOT
. Some instances of Flow allow you to import data from elsewhere on the server, in which case data may also exist outside these two locations.
Data has a many-to-one relationship with process executions to signify the process execution that created them, but also a many-to-many relationship with process executions to signify the data inputs to a process execution (data can be used by other process executions many times, but can only be created by a process execution once). Likewise it has a many-to-many relationship with executions, denoting the data inputs to an overall pipeline execution.
The key properties of the data model are:
- Name
filename
- Type
- string
- Description
The name of the file or directory as it exists on disk.
- Name
filetype
- Type
- string
- Description
The file's file extension, such as
"txt"
or"fastq.gz"
. This will be in the filename field too, but separating this out is useful for searching and filtering.
- Name
size
- Type
- int
- Description
The size of the data in bytes.
- Name
category
- Type
- int
- Description
The type of data - 2 is annotation sheets, 3 is multiplexed data, 4 is demultiplexed data, and 1 is everything else.
- Name
created
- Type
- int
- Description
The timestamp for the creation of the data.
- Name
is_ready
- Type
- boolean
- Description
Uploaded data is uploaded in chunks - the data object is created on the first chunk upload with this set to
False
, and then set toTrue
on the final chunk upload. For most purposes, data is not accessible when this isFalse
.
- Name
is_removed
- Type
- boolean
- Description
When data is deleted, if it hasn't been used as the input to anything then the data object is removed from the database. If it has, it is retained and this is set to
False
, rendering it inaccessible but preserving the record of its use.
- Name
is_directory
- Type
- boolean
- Description
Identifies the data as a directory rather than a simple file.
- Name
is_binary
- Type
- boolean
- Description
Identifies that the data is not plain text and shouldn't be previewable.
- Name
private
- Type
- boolean
- Description
If
false
, anybody will be able to view the data, even users not signed in (providing they can access the instance of Flow).
- Name
relative_process_execution_path
- Type
- string
- Description
For data produced by executions, Flow generally assumes the data is at the root of the producing process executions's directory - if it isn't, this field gives its location within that directory.
- Name
absolute_path
- Type
- string
- Description
For data that has been brought into Flow from elsewhere on the file system, this is the full absolute path to that data.
- Name
upstream_process_execution
- Type
- ID
- Description
The process execution that produced the data, if any.
- Name
downstream_process_executions
- Type
- [ID]
- Description
A many-to-many field identifying what process executions have used this data as input.
- Name
owner
- Type
- ID
- Description
The user who owns the data.
- Name
creator
- Type
- ID
- Description
The user who originally created the data.
- Name
group_owner
- Type
- ID
- Description
The group who owns the data.
Data Types
In addition to the built-in data attributes, different pipelines may wish to define their own data types for the purposes of filtering. For example, it may not be enough to simply specify that an input is a FASTA file, it may need to define a specific kind of FASTA file and filter available data by that which has been tagged with this custom data type.
The data type model which allows for this has the following attributes:
- Name
id
- Type
- string
- Description
A unique string, which is how the type will be referred to in pipeline schema.
- Name
name
- Type
- string
- Description
The name of the data type.
- Name
description
- Type
- string
- Description
A free text description of what the data type represents.
These are global objects, managed by admins.
Filesets
A fileset is any ordered grouping of data files.
The two main use-cases for filesets are for associating the reads files of a paired-end sample or multiplexed reads pair, and in representing the files of a genome release.
The model has the following attributes:
- Name
name
- Type
- string
- Description
The name of the fileset - often this will be auto-generated from the associated data filenames.
- Name
long_name
- Type
- string
- Description
Sometimes a fileset will have a longer version of its common name - useful when representing genome releases.
- Name
url
- Type
- string
- Description
A URL to any online resource associated with this fileset - again useful when representing a genome release.
Data objects associated with a fileset through a fileset
and fileset_order
attribute.