Organisms and Genomes
One of the most important ways of organising data in Flow is by the biological species it is associated with, and with the genome version it is aligned to.
Organisms
An organism is a specific, biological species that a Flow instance has data for - different instances will have different organisms defined. Its key properties are:
- Name
id
- Type
- string
- Description
A two character identifier for the organism, such as
"Hs"
. All Flow objects have IDs, but organisms are unique in having a meaningful string ID, rather than a random integer.
- Name
name
- Type
- string
- Description
The short, everyday name of the organism, such as
"Human"
or"Mouse"
.
- Name
latin_name
- Type
- string
- Description
The full, latin name of the organism, such as
"Homo sapiens"
or"Mus musculus"
.
Organisms are Flow-wide, and public - they are created by admins and are available to every user.
Genomes
In Flow a genome refers to a specific assembly of an organism's genome released by some authoriative body. Every Flow Genome must have a FASTA file and a GTF file, each of which represents a one-to-one relationship with the data model. It can also have multiple other files as needed, which are defined via a genome
field on the data model, creating a one-to-many relationship between genome and data.
- Name
name
- Type
- string
- Description
The name of the genome release.
- Name
long_name
- Type
- string
- Description
If the genome has a longer, more formal name, that can be represented with this field.
- Name
created
- Type
- int
- Description
The timestamp for when the genome was released.
- Name
url
- Type
- str
- Description
A URL to the release's official page, if any.
- Name
fasta
- Type
- ID
- Description
The data object for the genome's FASTA file.
- Name
gtf
- Type
- ID
- Description
The data object for the genome's GTF file.
- Name
organism
- Type
- ID
- Description
The organism the genome is for.
Some pipelines are tagged as being genome preparation pipelines - they take the data of a genome and generate useful indexes from them. As we have seen, these are pipelines where prepares_genome
is set to True
. Executions have a genome
attribute to indicate the genome whose files they have used in this case, creating a one-to-many relationship between genomes and executions. Likewise, some pipelines use the files from these genome preparation executions, and these are likewise tagged with the original genome.