Analysis

One of the core offerings of Flow is its ability to manage and run a suite of Nextflow pipelines. Here we will explore how those pipelines are represented, and how they are executed.

Pipelines

The pipeline model represents a single, distinct, named Nextflow pipeline. Different instances of Flow will have different pipelines installed. The key properties of pipelines are:

  • Name
    name
    Type
    string
    Description

    The overall name for the pipeline, as it is presented to the user.

  • Name
    is_nfcore
    Type
    boolean
    Description

    Whether or not this should be presented as an nf-core pipeline

  • Name
    is_demultiplex
    Type
    boolean
    Description

    Whether or not the pipeline demultiplexes samples. When true, this tells Flow to perform additional steps when processing the pipeline outputs to ensure samples are created.

  • Name
    imports_samples
    Type
    boolean
    Description

    Whether or not the pipeline imports samples. When true, this tells Flow to perform additional steps when processing the pipeline outputs to ensure samples are created.

  • Name
    prepares_genome
    Type
    boolean
    Description

    Whether or not this is a pipeline for creating genome preparations.

There are two ways that pipelines are organised. The first is via via pipeline categories and pipeline subcategories. Each pipeline model has a many-to-one relationship with a pipeline subcategory model, which in turn has a many-to-one relationship with a pipeline category model. These determine how the pipelines are presented in the frontend, and each has a description of what the category/subcategory represents.

The second is with the pipeline repo model. Each pipeline must be assoicated with a git repo, with a distinct URL that Flow can pull from. Pipeline repos and categories/subcategories are orthogonal - multiple pipelines from a single repo can be in different categories, and the pipelines in a single category or subcategory can be from different repos.

Pipeline versions

A single pipeline will have one or more pipeline versions - these refer to specific commits in the repo, and contain the paths to the actual files that should be run. When you run a pipeline, you are running a specific pipeline version. The key properties of pipeline versions are:

  • Name
    name
    Type
    string
    Description

    The name of the pipeline version as presented to the user.

  • Name
    git
    Type
    string
    Description

    The git commit to use for this version - this can be a commit hash, a branch, or a tag.

  • Name
    private
    Type
    bool
    Description

    If private, only admin users will be able to see and access this pipeline. Useful for testing pipeline integrations with Flow.

  • Name
    active
    Type
    bool
    Description

    Only active pipeline versions can be run. Typically the most recent versions will be active, while older versions will be disabled by setting them to be inactive.

  • Name
    description
    Type
    string
    Description

    A brief description of what the pipeline does (which may change in different versions).

  • Name
    long_description
    Type
    string
    Description

    A more in-depth description of what the pipeline does (which also may change in different versions).

  • Name
    created
    Type
    int
    Description

    The timestamp of the pipeline version's creation, which is used to order the versions.

  • Name
    path
    Type
    string
    Description

    The path to the main pipeline .nf file (relative to the repo root). This is the file that should actually be run with nextflow run.

  • Name
    schema_path
    Type
    string
    Description

    The path to the Flow schema file (relative to the repo root). This is a JSON file which describes the inputs and outputs of the pipeline.

  • Name
    config_paths
    Type
    string
    Description

    A comma separated list of paths to any additional config files within the repo that should be applied when running. These are in addition to the global config files run for every pipeline.

  • Name
    copy_full
    Type
    bool
    Description

    For some pipelines, there may be large, non-committed files in the repo that are needed for running. When this is true, the pipeline repo will be copied over for every run, instead of just being recreated from the .git directory.

  • Name
    genome_pipeline_versions
    Type
    [ID]
    Description

    For pipelines which can take genome preparations as inputs, this defines the pipeline versions whose executions can be used.

Executions

When you run a pipeline, you create an execution object. This represents the running of a single pipeline version.

Flow uses Celery to manage the execution of pipelines. When a run is submitted, an execution object is created and a response returned to the user giving the ID of the new object. Meanwhile the job is added to the Celery queue. Once Celery selects the job, it will submit the Nextflow run in an install-specific way (different institutes will have different established systems for submitting such jobs) and then will watch the execution output to populate the database with new objects around what is created.

The key properties of executions are:

  • Name
    identifier
    Type
    string
    Description

    The human readable name generated by Nextflow, typically two random words joined by an underscore.

  • Name
    pid
    Type
    string
    Description

    The PID of the main Nextflow process on the server (the process executions will have their own PIDs).

  • Name
    dependent
    Type
    bool
    Description

    Whether or not permissions applying to any samples or projects the execution is in should apply to this execution.

  • Name
    private
    Type
    bool
    Description

    If false, anybody will be able to view the execution, even users not signed in (providing they can access the instance of Flow).

  • Name
    resequence_samples
    Type
    bool
    Description

    Whether or not any samples created by the execution should be merged into existing samples where possible.

  • Name
    command
    Type
    string
    Description

    The full command-line command that was used to run this execution.

  • Name
    params
    Type
    string
    Description

    A JSON string containing the simple parameters passed to this execution.

  • Name
    data_params
    Type
    string
    Description

    A JSON string containing the data IDs passed as parameters to this execution.

  • Name
    sample_params
    Type
    string
    Description

    A JSON string containing the sample IDs passed as parameters to this execution, along with any additional columns of data.

  • Name
    nextflow_version
    Type
    string
    Description

    The version of Nextflow this execution was run with.

  • Name
    stdout
    Type
    string
    Description

    The full stdout produced by the run.

  • Name
    stderr
    Type
    string
    Description

    The full stderr produced by the run.

  • Name
    exit_code
    Type
    int
    Description

    The system exit code returned - 0 generally means it ran without issue.

  • Name
    status
    Type
    string
    Description

    The Nextflow reported status of the execution.

  • Name
    created
    Type
    int
    Description

    The timestamp for the initial creation of the execution in the request/response loop that submitted it.

  • Name
    task_started
    Type
    int
    Description

    The timestamp for the start of the celery process that submitted the execution. This may be a few seconds after created or it may be many hours, depending on the Celery queue.

  • Name
    started
    Type
    int
    Description

    The timestamp for the start of the Nextflow job itself - typically milliseconds after task_started.

  • Name
    finished
    Type
    int
    Description

    The timestamp for the end of the Nextflow job.

  • Name
    task_finished
    Type
    int
    Description

    The timestamp for the end of the celery process that submitted the execution. This may be some time after the Nextflow process itself ended, if there is a lot of post-processing to do.

  • Name
    owner
    Type
    ID
    Description

    The user who owns the execution.

  • Name
    creator
    Type
    ID
    Description

    The user who originally ran the execution.

  • Name
    group_owner
    Type
    ID
    Description

    The group who owns the execution.

Process Executions

An execution will have zero or more process executions. Nextflow pipelines work by chaining together multiple processes which typically (and in Flow, always) run in their own containerised environment. The process execution model represents each of these for a given execution. The key attributes are:

  • Name
    name
    Type
    string
    Description

    The full name of the process - typically the process name with some argument in parentheses at the end.

  • Name
    process_name
    Type
    string
    Description

    The name of the process itself, without and distinguishing arguments.

  • Name
    identifier
    Type
    string
    Description

    The Nextflow-generated identifier for the process.

  • Name
    started
    Type
    int
    Description

    The timestamp for the start of the process execution.

  • Name
    finished
    Type
    int
    Description

    The timestamp for the end of the process execution.

  • Name
    stdout
    Type
    string
    Description

    The full stdout produced by the process execution.

  • Name
    stderr
    Type
    string
    Description

    The full stderr produced by the process execution.

  • Name
    bash
    Type
    string
    Description

    The bash script generated by Nextflow for this process execution.

  • Name
    exit_code
    Type
    int
    Description

    The system exit code returned - 0 generally means it ran without issue.

  • Name
    status
    Type
    string
    Description

    The Nextflow reported status of the process execution.

Was this page helpful?