Data

Flow manages a large collection of data files - those uploaded, those produced by pipeline executions, and those imported from elsewhere on the server. Each of these files on disk is represented in the database.

The Data Model

In Flow, the data model represents a file or directory. When you search data, you are searching the corresponding record in the database, not the actual files on disk.

Generally, data in Flow exists one of two places: uploaded data is stored in the UPLOADS_ROOT, and data generated by pipeline executions is stored in the work directory of executions, which are stored in the EXECUTIONS_ROOT. Some instances of Flow allow you to import data from elsewhere on the server, in which case data may also exist outside these two locations.

Data has a many-to-one relationship with process executions to signify the process execution that created them, but also a many-to-many relationship with process executions to signify the data inputs to a process execution (data can be used by other process executions many times, but can only be created by a process execution once). Likewise it has a many-to-many relationship with executions, denoting the data inputs to an overall pipeline execution.

The key properties of the data model are:

  • Name
    filename
    Type
    string
    Description

    The name of the file or directory as it exists on disk.

  • Name
    filetype
    Type
    string
    Description

    The file's file extension, such as "txt" or "fastq.gz". This will be in the filename field too, but separating this out is useful for searching and filtering.

  • Name
    size
    Type
    int
    Description

    The size of the data in bytes.

  • Name
    category
    Type
    int
    Description

    The type of data - 2 is annotation sheets, 3 is multiplexed data, 4 is demultiplexed data, and 1 is everything else.

  • Name
    created
    Type
    int
    Description

    The timestamp for the creation of the data.

  • Name
    is_ready
    Type
    boolean
    Description

    Uploaded data is uploaded in chunks - the data object is created on the first chunk upload with this set to False, and then set to True on the final chunk upload. For most purposes, data is not accessible when this is False.

  • Name
    is_removed
    Type
    boolean
    Description

    When data is deleted, if it hasn't been used as the input to anything then the data object is removed from the database. If it has, it is retained and this is set to False, rendering it inaccessible but preserving the record of its use.

  • Name
    is_directory
    Type
    boolean
    Description

    Identifies the data as a directory rather than a simple file.

  • Name
    is_binary
    Type
    boolean
    Description

    Identifies that the data is not plain text and shouldn't be previewable.

  • Name
    private
    Type
    boolean
    Description

    If false, anybody will be able to view the data, even users not signed in (providing they can access the instance of Flow).

  • Name
    relative_process_execution_path
    Type
    string
    Description

    For data produced by executions, Flow generally assumes the data is at the root of the producing process executions's directory - if it isn't, this field gives its location within that directory.

  • Name
    absolute_path
    Type
    string
    Description

    For data that has been brought into Flow from elsewhere on the file system, this is the full absolute path to that data.

  • Name
    upstream_process_execution
    Type
    ID
    Description

    The process execution that produced the data, if any.

  • Name
    downstream_process_executions
    Type
    [ID]
    Description

    A many-to-many field identifying what process executions have used this data as input.

  • Name
    owner
    Type
    ID
    Description

    The user who owns the data.

  • Name
    creator
    Type
    ID
    Description

    The user who originally created the data.

  • Name
    group_owner
    Type
    ID
    Description

    The group who owns the data.

Data Types

In addition to the built-in data attributes, different pipelines may wish to define their own data types for the purposes of filtering. For example, it may not be enough to simply specify that an input is a FASTA file, it may need to define a specific kind of FASTA file and filter available data by that which has been tagged with this custom data type.

The data type model which allows for this has the following attributes:

  • Name
    id
    Type
    string
    Description

    A unique string, which is how the type will be referred to in pipeline schema.

  • Name
    name
    Type
    string
    Description

    The name of the data type.

  • Name
    description
    Type
    string
    Description

    A free text description of what the data type represents.

These are global objects, managed by admins.

Filesets

A fileset is any ordered grouping of data files.

The two main use-cases for filesets are for associating the reads files of a paired-end sample or multiplexed reads pair, and in representing the files of a genome release.

The model has the following attributes:

  • Name
    name
    Type
    string
    Description

    The name of the fileset - often this will be auto-generated from the associated data filenames.

  • Name
    long_name
    Type
    string
    Description

    Sometimes a fileset will have a longer version of its common name - useful when representing genome releases.

  • Name
    url
    Type
    string
    Description

    A URL to any online resource associated with this fileset - again useful when representing a genome release.

Data objects associated with a fileset through a fileset and fileset_order attribute.

Was this page helpful?