Samples
Samples are one of the most important ways of organising data in Flow. Conceptually, a sample represents a biological sample - or more specifically, the data associated with that biological sample.
The sample model
The only required data for a sample is its initial raw data - the reads files obtained from the sequencer. These are associated with a sample by being part of a fileset, which is itself associated with a sample. Each sample can therefore have multiple filesets associated with it, which can represent resequencing of the original biological sample.
An unanalysed sample will therefore only have this raw data. When an execution is run, Flow looks at the samples that the input data is associated with (either because it is the raw data of that sample, or is associated through one the mechanisms being outlined now) and if they are all identified with one sample (or no sample), the execution is made 'part of' the sample in a many-to-one relationship. All the output data of that execution is also then considered assoiated with the sample.
Similarly, process executions can become associated with a sample for a similar reason - so in an execution that takes multiple samples as inputs, while the execution as a whole won't be associated with any one sample, most of the data produced will be associated with one of the input samples.
- Name
name
- Type
- string
- Description
The sample's human readable name.
- Name
private
- Type
- boolean
- Description
If
false
, anybody will be able to view the sample, even users not signed in (providing they can access the instance of Flow). Noe that Flow requires certain criteria to be met before a sample can be made public, to protect the quality of the global public dataset.
- Name
created
- Type
- int
- Description
The timestamp for the creation of the sample.
- Name
owner
- Type
- ID
- Description
The user who owns the sample.
- Name
creator
- Type
- ID
- Description
The user who originally created the sample.
- Name
group_owner
- Type
- ID
- Description
The group who owns the sample.
Samples can be created either by uploading demultiplexed reads files, or by running a pipeline marked as a demultiplexing pipelines, or by running a pipeline marked as a sample importing pipeline. In the latter two cases, Flow has built-in logic for extracting the metadata from the files of the execution, and in the former case the metadata is provided directly by the user.
Sample metadata
Most of the attributes of sample objects are 'metadata' - distinguishing features of the original biological sample. Most of these are text, but some of them are other objects, including two that only exist as sample metadata - the sample source (a cell or tissue type that the sample was derived from) and the sample purification target (the target protein for purification). Each of these has the following attributes:
- Name
name
- Type
- string
- Description
The name of the source/target.
- Name
user
- Type
- ID
- Description
Optionally, the source/target can be associated with a specific user. If so, they are 'unvalidated' and visible only to that user - essentially a user contribution. Otherwise they are public, Flow-wide terms.
- Name
created
- Type
- int
- Description
The timestamp for the creation of the source/target.
The other sample metadata attributes are:
- Name
category
- Type
- string
- Description
The sample type - RNA-Seq, scRNA-Seq, ChIP-Seq or CLIP. The value of this attribute can determine which metadata fields are mandatory.
- Name
scientist
- Type
- string
- Description
The name of the researcher who prepared the original biological sample.
- Name
pi
- Type
- string
- Description
The PI of the lab who prepared the original biological sample.
- Name
organisation
- Type
- string
- Description
The organisation that prepared the original biological sample.
- Name
purification_agent
- Type
- string
- Description
The antibody used in sample preparation.
- Name
experimental_method
- Type
- string
- Description
This adds more specific detail to the sample category.
- Name
condition
- Type
- string
- Description
The experimental condition of the sample.
- Name
sequencer
- Type
- string
- Description
The sequencing equipment used to generate the data.
- Name
comments
- Type
- string
- Description
Any additional comments.
- Name
five_prime_barcode_sequence
- Type
- string
- Description
The 5' barcode sequence of the sample.
- Name
three_prime_barcode_sequence
- Type
- string
- Description
The 3' barcode sequence of the sample.
- Name
three_prime_adapter_name
- Type
- string
- Description
The 3' barcode adapter name of the sample.
- Name
three_prime_adapter_sequence
- Type
- string
- Description
The 3' barcode adapter sequence of the sample.
- Name
rt_primer
- Type
- string
- Description
The reverse transcription primer.
- Name
read1_primer
- Type
- string
- Description
The read 1 primer sequence.
- Name
read2_primer
- Type
- string
- Description
The read 2 primer sequence.
- Name
umi_barcode_sequence
- Type
- string
- Description
The UMI Barcode Sequence.
- Name
umi_separator
- Type
- string
- Description
The UMI separator string in the reads file.
- Name
strandedness
- Type
- string
- Description
Only needed for some sample categories - must be
"unstranded"
,"forward"
,"reverse"
or"auto"
.
- Name
rna_selection_method
- Type
- string
- Description
Only needed for some sample categories - must be
"polya"
,"ribominus"
, or"targeted"
.
- Name
source_text
- Type
- string
- Description
Any qualifying text to go with the sample source.
- Name
purification_target_text
- Type
- string
- Description
Any qualifying text to go with the sample purification target.
- Name
geo
- Type
- string
- Description
The GEO accession of the sample.
- Name
ena
- Type
- string
- Description
The ENA accession of the sample.
- Name
pubmed
- Type
- string
- Description
The pubmed ID associated with the sample.
- Name
organism
- Type
- ID
- Description
The organism the sample is associated with.
Projects
Samples are organised into projects. What a project represents for a give group or organisation varies, but typically they represent a single research question. A project has a one-to-many relationship with samples.
Just as an execution is assigned to a sample if all its input data belongs to a single sample, they can also be assigned to project if all the input data belongs to a single project. The executions of a project are therefore all of its samples' executions, and its directly contained executions.
- Name
name
- Type
- string
- Description
The project's name.
- Name
name
- Type
- string
- Description
The project's description - this should explain what the research aim is/was, and any other contextual information.
- Name
private
- Type
- boolean
- Description
If
false
, anybody will be able to view the project, even users not signed in (providing they can access the instance of Flow). All of its samples must be eligible to be public for a project to be made public.
- Name
created
- Type
- int
- Description
The timestamp for the creation of the project.
- Name
owner
- Type
- ID
- Description
The user who owns the project.
- Name
creator
- Type
- ID
- Description
The user who originally created the project.
- Name
group_owner
- Type
- ID
- Description
The group who owns the project.
Papers
Projects can have zero or more papers associated with them. These are not set directly, but are determined from the Pubmed IDs of the associated samples. Their attributes are:
- Name
id
- Type
- string
- Description
The Pubmed ID.
- Name
title
- Type
- string
- Description
The full title of the paper.
- Name
year
- Type
- int
- Description
The year of publication.
- Name
journal
- Type
- string
- Description
The name of the journal the paper was published in.