Analysis
One of the core offerings of Flow is its ability to manage and run a suite of Nextflow pipelines. Here we will explore how those pipelines are represented, and how they are executed.
Pipelines
The pipeline model represents a single, distinct, named Nextflow pipeline. Different instances of Flow will have different pipelines installed. The key properties of pipelines are:
- Name
name
- Type
- string
- Description
The overall name for the pipeline, as it is presented to the user.
- Name
is_nfcore
- Type
- boolean
- Description
Whether or not this should be presented as an nf-core pipeline
- Name
is_demultiplex
- Type
- boolean
- Description
Whether or not the pipeline demultiplexes samples. When true, this tells Flow to perform additional steps when processing the pipeline outputs to ensure samples are created.
- Name
imports_samples
- Type
- boolean
- Description
Whether or not the pipeline imports samples. When true, this tells Flow to perform additional steps when processing the pipeline outputs to ensure samples are created.
- Name
prepares_genome
- Type
- boolean
- Description
Whether or not this is a pipeline for creating genome preparations.
There are two ways that pipelines are organised. The first is via via pipeline categories and pipeline subcategories. Each pipeline model has a many-to-one relationship with a pipeline subcategory model, which in turn has a many-to-one relationship with a pipeline category model. These determine how the pipelines are presented in the frontend, and each has a description of what the category/subcategory represents.
The second is with the pipeline repo model. Each pipeline must be assoicated with a git repo, with a distinct URL that Flow can pull from. Pipeline repos and categories/subcategories are orthogonal - multiple pipelines from a single repo can be in different categories, and the pipelines in a single category or subcategory can be from different repos.
Pipeline versions
A single pipeline will have one or more pipeline versions - these refer to specific commits in the repo, and contain the paths to the actual files that should be run. When you run a pipeline, you are running a specific pipeline version. The key properties of pipeline versions are:
- Name
name
- Type
- string
- Description
The name of the pipeline version as presented to the user.
- Name
git
- Type
- string
- Description
The git commit to use for this version - this can be a commit hash, a branch, or a tag.
- Name
private
- Type
- bool
- Description
If private, only admin users will be able to see and access this pipeline. Useful for testing pipeline integrations with Flow.
- Name
active
- Type
- bool
- Description
Only active pipeline versions can be run. Typically the most recent versions will be active, while older versions will be disabled by setting them to be inactive.
- Name
description
- Type
- string
- Description
A brief description of what the pipeline does (which may change in different versions).
- Name
long_description
- Type
- string
- Description
A more in-depth description of what the pipeline does (which also may change in different versions).
- Name
created
- Type
- int
- Description
The timestamp of the pipeline version's creation, which is used to order the versions.
- Name
path
- Type
- string
- Description
The path to the main pipeline .nf file (relative to the repo root). This is the file that should actually be run with
nextflow run
.
- Name
schema_path
- Type
- string
- Description
The path to the Flow schema file (relative to the repo root). This is a JSON file which describes the inputs and outputs of the pipeline.
- Name
config_paths
- Type
- string
- Description
A comma separated list of paths to any additional config files within the repo that should be applied when running. These are in addition to the global config files run for every pipeline.
- Name
copy_full
- Type
- bool
- Description
For some pipelines, there may be large, non-committed files in the repo that are needed for running. When this is true, the pipeline repo will be copied over for every run, instead of just being recreated from the
.git
directory.
- Name
genome_pipeline_versions
- Type
- [ID]
- Description
For pipelines which can take genome preparations as inputs, this defines the pipeline versions whose executions can be used.
Executions
When you run a pipeline, you create an execution object. This represents the running of a single pipeline version.
Flow uses Celery to manage the execution of pipelines. When a run is submitted, an execution object is created and a response returned to the user giving the ID of the new object. Meanwhile the job is added to the Celery queue. Once Celery selects the job, it will submit the Nextflow run in an install-specific way (different institutes will have different established systems for submitting such jobs) and then will watch the execution output to populate the database with new objects around what is created.
The key properties of executions are:
- Name
identifier
- Type
- string
- Description
The human readable name generated by Nextflow, typically two random words joined by an underscore.
- Name
pid
- Type
- string
- Description
The PID of the main Nextflow process on the server (the process executions will have their own PIDs).
- Name
dependent
- Type
- bool
- Description
Whether or not permissions applying to any samples or projects the execution is in should apply to this execution.
- Name
private
- Type
- bool
- Description
If
false
, anybody will be able to view the execution, even users not signed in (providing they can access the instance of Flow).
- Name
resequence_samples
- Type
- bool
- Description
Whether or not any samples created by the execution should be merged into existing samples where possible.
- Name
command
- Type
- string
- Description
The full command-line command that was used to run this execution.
- Name
params
- Type
- string
- Description
A JSON string containing the simple parameters passed to this execution.
- Name
data_params
- Type
- string
- Description
A JSON string containing the data IDs passed as parameters to this execution.
- Name
sample_params
- Type
- string
- Description
A JSON string containing the sample IDs passed as parameters to this execution, along with any additional columns of data.
- Name
nextflow_version
- Type
- string
- Description
The version of Nextflow this execution was run with.
- Name
stdout
- Type
- string
- Description
The full stdout produced by the run.
- Name
stderr
- Type
- string
- Description
The full stderr produced by the run.
- Name
exit_code
- Type
- int
- Description
The system exit code returned - 0 generally means it ran without issue.
- Name
status
- Type
- string
- Description
The Nextflow reported status of the execution.
- Name
created
- Type
- int
- Description
The timestamp for the initial creation of the execution in the request/response loop that submitted it.
- Name
task_started
- Type
- int
- Description
The timestamp for the start of the celery process that submitted the execution. This may be a few seconds after
created
or it may be many hours, depending on the Celery queue.
- Name
started
- Type
- int
- Description
The timestamp for the start of the Nextflow job itself - typically milliseconds after
task_started
.
- Name
finished
- Type
- int
- Description
The timestamp for the end of the Nextflow job.
- Name
task_finished
- Type
- int
- Description
The timestamp for the end of the celery process that submitted the execution. This may be some time after the Nextflow process itself ended, if there is a lot of post-processing to do.
- Name
owner
- Type
- ID
- Description
The user who owns the execution.
- Name
creator
- Type
- ID
- Description
The user who originally ran the execution.
- Name
group_owner
- Type
- ID
- Description
The group who owns the execution.
Process Executions
An execution will have zero or more process executions. Nextflow pipelines work by chaining together multiple processes which typically (and in Flow, always) run in their own containerised environment. The process execution model represents each of these for a given execution. The key attributes are:
- Name
name
- Type
- string
- Description
The full name of the process - typically the process name with some argument in parentheses at the end.
- Name
process_name
- Type
- string
- Description
The name of the process itself, without and distinguishing arguments.
- Name
identifier
- Type
- string
- Description
The Nextflow-generated identifier for the process.
- Name
started
- Type
- int
- Description
The timestamp for the start of the process execution.
- Name
finished
- Type
- int
- Description
The timestamp for the end of the process execution.
- Name
stdout
- Type
- string
- Description
The full stdout produced by the process execution.
- Name
stderr
- Type
- string
- Description
The full stderr produced by the process execution.
- Name
bash
- Type
- string
- Description
The bash script generated by Nextflow for this process execution.
- Name
exit_code
- Type
- int
- Description
The system exit code returned - 0 generally means it ran without issue.
- Name
status
- Type
- string
- Description
The Nextflow reported status of the process execution.