Data

Flow manages a large collection of data files - those uploaded, those produced by pipeline executions, and those imported from elsewhere on the server. Each of these files on disk is represented in the database.

The Data Model

In Flow, the data model represents a file or directory. When you search data, you are searching the corresponding record in the database, not the actual files on disk.

Generally, data in Flow exists one of two places: uploaded data is stored in the UPLOADS_ROOT, and data generated by pipeline executions is stored in the work directory of executions, which are stored in the EXECUTIONS_ROOT. Some instances of Flow allow you to import data from elsewhere on the server, in which case data may also exist outside these two locations.

Data has a many-to-one relationship with process executions to signify the process execution that created them, but also a many-to-many relationship with process executions to signify the data inputs to a process execution (data can be used by other process executions many times, but can only be created by a process execution once). Likewise it has a many-to-many relationship with executions, denoting the data inputs to an overall pipeline execution.

The key properties of the data model are:

Name
filename
Type
string
Description
The name of the file or directory as it exists on disk.
Name
filetype
Type
string
Description
The file's file extension, such as "txt" or "fastq.gz". This will be in the filename field too, but separating this out is useful for searching and filtering.
Name
size
Type
int
Description
The size of the data in bytes.
Name
category
Type
int
Description
The type of data - 2 is annotation sheets, 3 is multiplexed data, 4 is demultiplexed data, and 1 is everything else.
Name
created
Type
int
Description
The timestamp for the creation of the data.
Name
is_ready
Type
boolean
Description
Uploaded data is uploaded in chunks - the data object is created on the first chunk upload with this set to False, and then set to True on the final chunk upload. For most purposes, data is not accessible when this is False.
Name
is_removed
Type
boolean
Description
When data is deleted, if it hasn't been used as the input to anything then the data object is removed from the database. If it has, it is retained and this is set to False, rendering it inaccessible but preserving the record of its use.
Name
is_directory
Type
boolean
Description
Identifies the data as a directory rather than a simple file.
Name
is_binary
Type
boolean
Description
Identifies that the data is not plain text and shouldn't be previewable.
Name
private
Type
boolean
Description
If false, anybody will be able to view the data, even users not signed in (providing they can access the instance of Flow).
Name
relative_process_execution_path
Type
string
Description
For data produced by executions, Flow generally assumes the data is at the root of the producing process executions's directory - if it isn't, this field gives its location within that directory.
Name
absolute_path
Type
string
Description
For data that has been brought into Flow from elsewhere on the file system, this is the full absolute path to that data.
Name
upstream_process_execution
Type
ID
Description
The process execution that produced the data, if any.
Name
downstream_process_executions
Type
[ID]
Description
A many-to-many field identifying what process executions have used this data as input.
Name
owner
Type
ID
Description
The user who owns the data.
Name
creator
Type
ID
Description
The user who originally created the data.
Name
group_owner
Type
ID
Description
The group who owns the data.

Data Types

In addition to the built-in data attributes, different pipelines may wish to define their own data types for the purposes of filtering. For example, it may not be enough to simply specify that an input is a FASTA file, it may need to define a specific kind of FASTA file and filter available data by that which has been tagged with this custom data type.

The data type model which allows for this has the following attributes:

Name
id
Type
string
Description
A unique string, which is how the type will be referred to in pipeline schema.
Name
name
Type
string
Description
The name of the data type.
Name
description
Type
string
Description
A free text description of what the data type represents.

These are global objects, managed by admins.

Filesets

A fileset is any ordered grouping of data files.

The two main use-cases for filesets are for associating the reads files of a paired-end sample or multiplexed reads pair, and in representing the files of a genome release.

The model has the following attributes:

Name
name
Type
string
Description
The name of the fileset - often this will be auto-generated from the associated data filenames.
Name
long_name
Type
string
Description
Sometimes a fileset will have a longer version of its common name - useful when representing genome releases.
Name
url
Type
string
Description
A URL to any online resource associated with this fileset - again useful when representing a genome release.

Data objects associated with a fileset through a fileset and fileset_order attribute.