Analysis

One of the core offerings of Flow is its ability to manage and run a suite of Nextflow pipelines. Here we will explore how those pipelines are represented, and how they are executed.

Pipelines

The pipeline model represents a single, distinct, named Nextflow pipeline. Different instances of Flow will have different pipelines installed. The key properties of pipelines are:

Name
name
Type
string
Description
The overall name for the pipeline, as it is presented to the user.
Name
is_nfcore
Type
boolean
Description
Whether or not this should be presented as an nf-core pipeline
Name
is_demultiplex
Type
boolean
Description
Whether or not the pipeline demultiplexes samples. When true, this tells Flow to perform additional steps when processing the pipeline outputs to ensure samples are created.
Name
imports_samples
Type
boolean
Description
Whether or not the pipeline imports samples. When true, this tells Flow to perform additional steps when processing the pipeline outputs to ensure samples are created.

There are two ways that pipelines are organised. The first is via via pipeline categories and pipeline subcategories. Each pipeline model has a many-to-one relationship with a pipeline subcategory model, which in turn has a many-to-one relationship with a pipeline category model. These determine how the pipelines are presented in the frontend, and each has a description of what the category/subcategory represents.

The second is with the pipeline repo model. Each pipeline must be assoicated with a git repo, with a distinct URL that Flow can pull from. Pipeline repos and categories/subcategories are orthogonal - multiple pipelines from a single repo can be in different categories, and the pipelines in a single category or subcategory can be from different repos.

Pipeline versions

A single pipeline will have one or more pipeline versions - these refer to specific commits in the repo, and contain the paths to the actual files that should be run. When you run a pipeline, you are running a specific pipeline version. The key properties of pipeline versions are:

Name
name
Type
string
Description
The name of the pipeline version as presented to the user.
Name
git
Type
string
Description
The git commit to use for this version - this can be a commit hash, a branch, or a tag.
Name
private
Type
bool
Description
If private, only admin users will be able to see and access this pipeline. Useful for testing pipeline integrations with Flow.
Name
active
Type
bool
Description
Only active pipeline versions can be run. Typically the most recent versions will be active, while older versions will be disabled by setting them to be inactive.
Name
description
Type
string
Description
A brief description of what the pipeline does (which may change in different versions).
Name
long_description
Type
string
Description
A more in-depth description of what the pipeline does (which also may change in different versions).
Name
created
Type
int
Description
The timestamp of the pipeline version's creation, which is used to order the versions.
Name
path
Type
string
Description
The path to the main pipeline .nf file (relative to the repo root). This is the file that should actually be run with nextflow run.
Name
schema_path
Type
string
Description
The path to the Flow schema file (relative to the repo root). This is a JSON file which describes the inputs and outputs of the pipeline.
Name
config_paths
Type
string
Description
A comma separated list of paths to any additional config files within the repo that should be applied when running. These are in addition to the global config files run for every pipeline.
Name
copy_full
Type
bool
Description
For some pipelines, there may be large, non-committed files in the repo that are needed for running. When this is true, the pipeline repo will be copied over for every run, instead of just being recreated from the .git directory.
Name
upstream_pipeline_versions
Type
[ID]
Description
For pipelines which can take fileset preparations as inputs, this defines the pipeline versions whose executions can be used.

Executions

When you run a pipeline, you create an execution object. This represents the running of a single pipeline version.

Flow uses Celery to manage the execution of pipelines. When a run is submitted, an execution object is created and a response returned to the user giving the ID of the new object. Meanwhile the job is added to the Celery queue. Once Celery selects the job, it will submit the Nextflow run in an install-specific way (different institutes will have different established systems for submitting such jobs) and then will watch the execution output to populate the database with new objects around what is created.

The key properties of executions are:

Name
identifier
Type
string
Description
The human readable name generated by Nextflow, typically two random words joined by an underscore.
Name
pid
Type
string
Description
The PID of the main Nextflow process on the server (the process executions will have their own PIDs).
Name
dependent
Type
bool
Description
Whether or not permissions applying to any samples or projects the execution is in should apply to this execution.
Name
private
Type
bool
Description
If false, anybody will be able to view the execution, even users not signed in (providing they can access the instance of Flow).
Name
resequence_samples
Type
bool
Description
Whether or not any samples created by the execution should be merged into existing samples where possible.
Name
command
Type
string
Description
The full command-line command that was used to run this execution.
Name
params
Type
string
Description
A JSON string containing the simple parameters passed to this execution.
Name
data_params
Type
string
Description
A JSON string containing the data IDs passed as parameters to this execution.
Name
sample_params
Type
string
Description
A JSON string containing the sample IDs passed as parameters to this execution, along with any additional columns of data.
Name
nextflow_version
Type
string
Description
The version of Nextflow this execution was run with.
Name
stdout
Type
string
Description
The full stdout produced by the run.
Name
stderr
Type
string
Description
The full stderr produced by the run.
Name
exit_code
Type
int
Description
The system exit code returned - 0 generally means it ran without issue.
Name
status
Type
string
Description
The Nextflow reported status of the execution.
Name
created
Type
int
Description
The timestamp for the initial creation of the execution in the request/response loop that submitted it.
Name
task_started
Type
int
Description
The timestamp for the start of the celery process that submitted the execution. This may be a few seconds after created or it may be many hours, depending on the Celery queue.
Name
started
Type
int
Description
The timestamp for the start of the Nextflow job itself - typically milliseconds after task_started.
Name
finished
Type
int
Description
The timestamp for the end of the Nextflow job.
Name
task_finished
Type
int
Description
The timestamp for the end of the celery process that submitted the execution. This may be some time after the Nextflow process itself ended, if there is a lot of post-processing to do.
Name
owner
Type
ID
Description
The user who owns the execution.
Name
creator
Type
ID
Description
The user who originally ran the execution.
Name
group_owner
Type
ID
Description
The group who owns the execution.

Process Executions

An execution will have zero or more process executions. Nextflow pipelines work by chaining together multiple processes which typically (and in Flow, always) run in their own containerised environment. The process execution model represents each of these for a given execution. The key attributes are:

Name
name
Type
string
Description
The full name of the process - typically the process name with some argument in parentheses at the end.
Name
process_name
Type
string
Description
The name of the process itself, without and distinguishing arguments.
Name
identifier
Type
string
Description
The Nextflow-generated identifier for the process.
Name
started
Type
int
Description
The timestamp for the start of the process execution.
Name
finished
Type
int
Description
The timestamp for the end of the process execution.
Name
stdout
Type
string
Description
The full stdout produced by the process execution.
Name
stderr
Type
string
Description
The full stderr produced by the process execution.
Name
bash
Type
string
Description
The bash script generated by Nextflow for this process execution.
Name
exit_code
Type
int
Description
The system exit code returned - 0 generally means it ran without issue.
Name
status
Type
string
Description
The Nextflow reported status of the process execution.