subsample_by_metadata_with_focal

WORKFLOW subsample_by_metadata_with_focal

File Path	`pipes/WDL/workflows/subsample_by_metadata_with_focal.wdl`
WDL Version	1.0
Type	workflow

Imports

Namespace	Path
`nextstrain`	`../tasks/tasks_nextstrain.wdl`
`reports`	`../tasks/tasks_reports.wdl`
`utils`	`../tasks/tasks_utils.wdl`

Workflow: subsample_by_metadata_with_focal

Filter and subsample a global sequence set with a bias towards a geographic area of interest.

Inputs

Name	Type	Description	Default
`sample_metadata_tsvs`	`Array[File]+`	Tab-separated metadata file that contain binning variables and values. Must contain all samples: output will be filtered to the IDs present in this file.	-
`sequences_fasta`	`File`	Sequences in fasta format.	-
`priorities`	`File?`	-	-
`lab_highlight_loc`	`String?`	-	-
`sequences_per_group`	`Int?`	-	-
`group_by`	`String?`	-	-
`include`	`File?`	-	-
`exclude`	`File?`	-	-
`min_date`	`Float?`	-	-
`max_date`	`Float?`	-	-
`min_length`	`Int?`	-	-
`priority`	`File?`	-	-
`subsample_seed`	`Int?`	-	-
`exclude_where`	`Array[String]?`	-	-
`include_where`	`Array[String]?`	-	-
`include`	`File?`	-	-
`exclude`	`File?`	-	-
`min_date`	`Float?`	-	-
`max_date`	`Float?`	-	-
`min_length`	`Int?`	-	-
`subsample_seed`	`Int?`	-	-
`include_where`	`Array[String]?`	-	-
`include`	`File?`	-	-
`exclude`	`File?`	-	-
`min_date`	`Float?`	-	-
`max_date`	`Float?`	-	-
`min_length`	`Int?`	-	-
`subsample_seed`	`Int?`	-	-
`include_where`	`Array[String]?`	-	-
22 optional inputs with default values
`focal_variable`	`String`	The dataset will be bifurcated based on this column header.	"region"
`focal_value`	`String`	The dataset will be bifurcated based whether the focal_variable column matches this value or not. Rows that match this value are considered to be part of the 'focal' set of interest, rows that do not are part of the 'global' set.	"North America"
`focal_bin_variable`	`String`	The focal subset of samples will be evenly subsampled across the discrete values of this column header.	"division"
`focal_bin_max`	`Int`	The output will contain no more than this number of focal samples from each discrete value in the focal_bin_variable column.	50
`global_bin_variable`	`String`	The global subset of samples will be evenly subsampled across the discrete values of this column header.	"country"
`global_bin_max`	`Int`	The output will contain no more than this number of global samples from each discrete value in the global_bin_variable column.	50
`out_suffix`	`String`	-	".txt"
`prefer_first`	`Boolean`	-	true
`machine_mem_gb`	`Int`	-	7
`table_map`	`Array[File]`	-	[]
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`disk_size`	`Int`	-	50
`non_nucleotide`	`Boolean`	-	true
`docker`	`String`	-	"docker.io/nextstrain/base:build-20240318T173028Z"
`disk_size`	`Int`	-	750
`non_nucleotide`	`Boolean`	-	true
`docker`	`String`	-	"docker.io/nextstrain/base:build-20240318T173028Z"
`disk_size`	`Int`	-	750
`non_nucleotide`	`Boolean`	-	true
`docker`	`String`	-	"docker.io/nextstrain/base:build-20240318T173028Z"
`disk_size`	`Int`	-	750
`cpus`	`Int`	-	4

Outputs

Name	Type	Expression
`metadata_merged`	`File`	`derived_cols.derived_metadata`
`keep_list`	`File`	`fasta_to_ids.ids_txt`
`subsampled_sequences`	`File`	`cat_fasta.combined`
`focal_kept`	`Int`	`subsample_focal.sequences_out`
`global_kept`	`Int`	`subsample_global.sequences_out`
`sequences_kept`	`Int`	`subsample_focal.sequences_out + subsample_global.sequences_out`

Calls

This workflow calls the following tasks or subworkflows:

CALL TASKS `tsv_join` ↗

Input Mappings (3)

Input	Value
`input_tsvs`	`sample_metadata_tsvs`
`id_col`	`'strain'`
`out_basename`	`"metadata-merged"`

CALL TASKS `derived_cols` ↗

Input Mappings (1)

Input	Value
`metadata_tsv`	`select_first(flatten([[tsv_join.out_tsv], sample_metadata_tsvs]))`

CALL TASKS `prefilter` ↗ → filter_subsample_sequences

Input Mappings (2)

Input	Value
`sequences_fasta`	`sequences_fasta`
`sample_metadata_tsv`	`derived_cols.derived_metadata`

CALL TASKS `subsample_focal` ↗ → filter_subsample_sequences

Input Mappings (6)

Input	Value
`sequences_fasta`	`prefilter.filtered_fasta`
`sample_metadata_tsv`	`derived_cols.derived_metadata`
`exclude_where`	`["~{focal_variable}!=~{focal_value}"]`
`sequences_per_group`	`focal_bin_max`
`group_by`	`focal_bin_variable`
`priority`	`priorities`

CALL TASKS `subsample_global` ↗ → filter_subsample_sequences

Input Mappings (6)

Input	Value
`sequences_fasta`	`prefilter.filtered_fasta`
`sample_metadata_tsv`	`derived_cols.derived_metadata`
`exclude_where`	`["~{focal_variable}=~{focal_value}"]`
`sequences_per_group`	`global_bin_max`
`group_by`	`global_bin_variable`
`priority`	`priorities`

CALL TASKS `cat_fasta` ↗ → concatenate

Input Mappings (2)

Input	Value
`infiles`	`[subsample_focal.filtered_fasta, subsample_global.filtered_fasta]`
`output_name`	`"subsampled.fasta"`

CALL TASKS `fasta_to_ids` ↗

Input Mappings (1)

Input	Value
`sequences_fasta`	`cat_fasta.combined`

Images

Container images used by tasks in this workflow:

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:

derived_cols
tsv_join

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 3 tasks:

prefilter
subsample_focal
subsample_global

🐳 ubuntu

ubuntu

Used by 2 tasks:

cat_fasta
fasta_to_ids

← Back to Index

flowchart TD
    Start([subsample_by_metadata_with_focal])
    subgraph C1 ["↔️ if length(sample_metadata_tsvs) > 1"]
        direction TB
        N1["tsv_join"]
    end
    N2["derived_cols"]
    N3["prefilter
filter_subsample_sequences"]
    N4["subsample_focal
filter_subsample_sequences"]
    N5["subsample_global
filter_subsample_sequences"]
    N6["cat_fasta
concatenate"]
    N7["fasta_to_ids"]
    N1 --> N2
    N2 --> N3
    N2 --> N4
    N3 --> N4
    N2 --> N5
    N3 --> N5
    N4 --> N6
    N5 --> N6
    N6 --> N7
    Start --> N1
    N7 --> End([End])
    classDef taskNode fill:#a371f7,stroke:#8b5cf6,stroke-width:2px,color:#fff
    classDef workflowNode fill:#58a6ff,stroke:#1f6feb,stroke-width:2px,color:#fff

version 1.0

import "../tasks/tasks_nextstrain.wdl" as nextstrain
import "../tasks/tasks_reports.wdl" as reports
import "../tasks/tasks_utils.wdl" as utils

workflow subsample_by_metadata_with_focal {
    meta {
        description: "Filter and subsample a global sequence set with a bias towards a geographic area of interest."
    }

    parameter_meta {
        sample_metadata_tsvs: {
            description: "Tab-separated metadata file that contain binning variables and values. Must contain all samples: output will be filtered to the IDs present in this file.",
            patterns: ["*.txt", "*.tsv"]
        }
        sequences_fasta: {
            description: "Sequences in fasta format.",
            patterns: ["*.fasta"]
        }

        focal_variable: {
            description: "The dataset will be bifurcated based on this column header."
        }
        focal_value: {
            description: "The dataset will be bifurcated based whether the focal_variable column matches this value or not. Rows that match this value are considered to be part of the 'focal' set of interest, rows that do not are part of the 'global' set."
        }

        focal_bin_variable: {
            description: "The focal subset of samples will be evenly subsampled across the discrete values of this column header."
        }
        focal_bin_max: {
            description: "The output will contain no more than this number of focal samples from each discrete value in the focal_bin_variable column."
        }

        global_bin_variable: {
            description: "The global subset of samples will be evenly subsampled across the discrete values of this column header."
        }
        global_bin_max: {
            description: "The output will contain no more than this number of global samples from each discrete value in the global_bin_variable column."
        }
    }

    input {
        Array[File]+ sample_metadata_tsvs
        File    sequences_fasta
        File?   priorities

        String  focal_variable      = "region"
        String  focal_value         = "North America"
        
        String  focal_bin_variable  = "division"
        Int     focal_bin_max       = 50
        
        String  global_bin_variable = "country"
        Int     global_bin_max      = 50
    }

    if(length(sample_metadata_tsvs)>1) {
        call utils.tsv_join {
            input:
                input_tsvs   = sample_metadata_tsvs,
                id_col       = 'strain',
                out_basename = "metadata-merged"
        }
    }

    call nextstrain.derived_cols {
        input:
            metadata_tsv = select_first(flatten([[tsv_join.out_tsv], sample_metadata_tsvs]))
    }

    call nextstrain.filter_subsample_sequences as prefilter {
        input:
            sequences_fasta     = sequences_fasta,
            sample_metadata_tsv = derived_cols.derived_metadata
    }

    call nextstrain.filter_subsample_sequences as subsample_focal {
        input:
            sequences_fasta     = prefilter.filtered_fasta,
            sample_metadata_tsv = derived_cols.derived_metadata,
            exclude_where       = ["${focal_variable}!=${focal_value}"],
            sequences_per_group = focal_bin_max,
            group_by            = focal_bin_variable,
            priority            = priorities
    }

    call nextstrain.filter_subsample_sequences as subsample_global {
        input:
            sequences_fasta     = prefilter.filtered_fasta,
            sample_metadata_tsv = derived_cols.derived_metadata,
            exclude_where       = ["${focal_variable}=${focal_value}"],
            sequences_per_group = global_bin_max,
            group_by            = global_bin_variable,
            priority            = priorities
    }

    call utils.concatenate as cat_fasta {
        input:
            infiles = [
                subsample_focal.filtered_fasta, subsample_global.filtered_fasta
            ],
            output_name = "subsampled.fasta"
    }

    call utils.fasta_to_ids {
        input:
            sequences_fasta = cat_fasta.combined
    }

    output {
        File metadata_merged      = derived_cols.derived_metadata
        File keep_list            = fasta_to_ids.ids_txt
        File subsampled_sequences = cat_fasta.combined
        Int  focal_kept           = subsample_focal.sequences_out
        Int  global_kept          = subsample_global.sequences_out
        Int  sequences_kept       = subsample_focal.sequences_out + subsample_global.sequences_out
    }
}

version 1.0 import "../tasks/tasks_nextstrain.wdl" as nextstrain import "../tasks/tasks_reports.wdl" as reports import "../tasks/tasks_utils.wdl" as utils workflow subsample_by_metadata_with_focal { meta { description: "Filter and subsample a global sequence set with a bias towards a geographic area of interest." } parameter_meta { sample_metadata_tsvs: { description: "Tab-separated metadata file that contain binning variables and values. Must contain all samples: output will be filtered to the IDs present in this file.", patterns: ["*.txt", "*.tsv"] } sequences_fasta: { description: "Sequences in fasta format.", patterns: ["*.fasta"] } focal_variable: { description: "The dataset will be bifurcated based on this column header." } focal_value: { description: "The dataset will be bifurcated based whether the focal_variable column matches this value or not. Rows that match this value are considered to be part of the 'focal' set of interest, rows that do not are part of the 'global' set." } focal_bin_variable: { description: "The focal subset of samples will be evenly subsampled across the discrete values of this column header." } focal_bin_max: { description: "The output will contain no more than this number of focal samples from each discrete value in the focal_bin_variable column." } global_bin_variable: { description: "The global subset of samples will be evenly subsampled across the discrete values of this column header." } global_bin_max: { description: "The output will contain no more than this number of global samples from each discrete value in the global_bin_variable column." } } input { Array[File]+ sample_metadata_tsvs File sequences_fasta File? priorities String focal_variable = "region" String focal_value = "North America" String focal_bin_variable = "division" Int focal_bin_max = 50 String global_bin_variable = "country" Int global_bin_max = 50 } if(length(sample_metadata_tsvs)>1) { call utils.tsv_join { input: input_tsvs = sample_metadata_tsvs, id_col = 'strain', out_basename = "metadata-merged" } } call nextstrain.derived_cols { input: metadata_tsv = select_first(flatten([[tsv_join.out_tsv], sample_metadata_tsvs])) } call nextstrain.filter_subsample_sequences as prefilter { input: sequences_fasta = sequences_fasta, sample_metadata_tsv = derived_cols.derived_metadata } call nextstrain.filter_subsample_sequences as subsample_focal { input: sequences_fasta = prefilter.filtered_fasta, sample_metadata_tsv = derived_cols.derived_metadata, exclude_where = ["${focal_variable}!=${focal_value}"], sequences_per_group = focal_bin_max, group_by = focal_bin_variable, priority = priorities } call nextstrain.filter_subsample_sequences as subsample_global { input: sequences_fasta = prefilter.filtered_fasta, sample_metadata_tsv = derived_cols.derived_metadata, exclude_where = ["${focal_variable}=${focal_value}"], sequences_per_group = global_bin_max, group_by = global_bin_variable, priority = priorities } call utils.concatenate as cat_fasta { input: infiles = [ subsample_focal.filtered_fasta, subsample_global.filtered_fasta ], output_name = "subsampled.fasta" } call utils.fasta_to_ids { input: sequences_fasta = cat_fasta.combined } output { File metadata_merged = derived_cols.derived_metadata File keep_list = fasta_to_ids.ids_txt File subsampled_sequences = cat_fasta.combined Int focal_kept = subsample_focal.sequences_out Int global_kept = subsample_global.sequences_out Int sequences_kept = subsample_focal.sequences_out + subsample_global.sequences_out } }

WORKFLOW subsample_by_metadata_with_focal

Imports

Workflow: subsample_by_metadata_with_focal

Inputs

Outputs

Calls

CALL TASKS tsv_join ↗

CALL TASKS derived_cols ↗

CALL TASKS prefilter ↗ → filter_subsample_sequences

CALL TASKS subsample_focal ↗ → filter_subsample_sequences

CALL TASKS subsample_global ↗ → filter_subsample_sequences

CALL TASKS cat_fasta ↗ → concatenate

CALL TASKS fasta_to_ids ↗

Images