assemble_refbased - WDL Atlas

WORKFLOW assemble_refbased

File Path	`pipes/WDL/workflows/assemble_refbased.wdl`
WDL Version	1.0
Type	workflow

Imports

Namespace	Path
`assembly`	`../tasks/tasks_assembly.wdl`
`intrahost`	`../tasks/tasks_intrahost.wdl`
`reports`	`../tasks/tasks_reports.wdl`
`read_utils`	`../tasks/tasks_read_utils.wdl`

Workflow: assemble_refbased

Reference-based microbial consensus calling. Aligns NGS reads to a singular reference genome, calls a new consensus sequence, and emits: new assembly, reads aligned to provided reference, reads aligned to new assembly, various figures of merit, plots, and QC metrics. The user may provide unaligned reads spread across multiple input files and this workflow will parallelize alignment per input file before merging results prior to consensus calling.

Author: Broad Viral Genomics

viral-ngs@broadinstitute.org

Subworkflow Usage

This workflow is called as a subworkflow by 7 other workflows:

assemble_denovo - pipes/WDL/workflows/assemble_denovo.wdl
assemble_denovo_metagenomic - pipes/WDL/workflows/assemble_denovo_metagenomic.wdl
metagenomic_denovo - pipes/WDL/workflows/metagenomic_denovo.wdl
sarscov2_illumina_full - pipes/WDL/workflows/sarscov2_illumina_full.wdl
sarscov2_sra_to_genbank - pipes/WDL/workflows/sarscov2_sra_to_genbank.wdl
scaffold_and_refine - pipes/WDL/workflows/scaffold_and_refine.wdl
scaffold_and_refine_multitaxa - pipes/WDL/workflows/scaffold_and_refine_multitaxa.wdl

Inputs

Name	Type	Description	Default
`reads_unmapped_bams`	`Array[File]+`	Unaligned reads in BAM format	-
`reference_fasta`	`File`	Reference genome to align reads to.	-
`sample_original_name`	`String?`	-	-
`novocraft_license`	`File?`	-	-
`trim_coords_bed`	`File?`	optional primers to trim in reference coordinate space (0-based BED format)	-
`machine_mem_gb`	`Int?`	-	-
`min_keep_length`	`Int?`	-	-
`sliding_window`	`Int?`	-	-
`primer_offset`	`Int?`	-	-
`machine_mem_gb`	`Int?`	-	-
`reheader_table`	`File?`	-	-
`amplicon_set`	`String?`	-	-
`max_coverage_depth`	`Int?`	-	-
`base_q_threshold`	`Int?`	-	-
`mapping_q_threshold`	`Int?`	-	-
`read_length_threshold`	`Int?`	-	-
`plotXLimits`	`String?`	-	-
`plotYLimits`	`String?`	-	-
`machine_mem_gb`	`Int?`	-	-
`reheader_table`	`File?`	-	-
`max_coverage_depth`	`Int?`	-	-
`base_q_threshold`	`Int?`	-	-
`mapping_q_threshold`	`Int?`	-	-
`read_length_threshold`	`Int?`	-	-
`plotXLimits`	`String?`	-	-
`plotYLimits`	`String?`	-	-
51 optional inputs with default values
`sample_name`	`String`	Base name of output files. The 'SM' field in BAM read group headers are also rewritten to this value. Avoid spaces and other filename-unfriendly characters.	basename(reads_unmapped_bams[0],'.bam')
`aligner`	`String`	Read aligner software to use. Options: novoalign, bwa, minimap2. Minimap2 can automatically handle Illumina, PacBio, or Oxford Nanopore reads as long as the 'PL' field in the BAM read group header is set properly (novoalign and bwa are Illumina-only).	"minimap2"
`min_coverage`	`Int`	Minimum read coverage required to call a position unambiguous.	3
`major_cutoff`	`Float`	If the major allele is present at a frequency higher than this cutoff, we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position.	0.75
`skip_mark_dupes`	`Boolean`	skip Picard MarkDuplicates step after alignment. This is recommended to be set to true for PCR amplicon based data. (Default: false)	false
`align_to_ref_options`	`Map[String,String]`	-	{"novoalign": "-r Random -l 40 -g 40 -x 20 -t 501 -k", "bwa": "-k 12 -B 1", "minimap2": ""}
`align_to_self_options`	`Map[String,String]`	-	{"novoalign": "-r Random -l 40 -g 40 -x 20 -t 100", "bwa": "", "minimap2": ""}
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`sample_name`	`String`	Base name of output files. The 'SM' field in BAM read group headers are also rewritten to this value. Avoid spaces and other filename-unfriendly characters.	basename(basename(basename(reads_unmapped_bam,".bam"),".taxfilt"),".clean")
`min_quality`	`Int?`	-	1
`docker`	`String`	-	"andersenlabapps/ivar:1.3.1"
`bam_basename`	`String`	-	basename(aligned_bam,".bam")
`disk_size`	`Int`	-	375
`run_fastqc`	`Boolean`	-	false
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`disk_size`	`Int`	-	750
`machine_mem_gb`	`Int`	-	4
`out_basename`	`String`	-	basename(aligned_bam,'.bam')
`docker`	`String`	-	"quay.io/broadinstitute/viral-phylo:2.5.1.0"
`max_amp_len`	`Int`	-	5000
`max_amplicons`	`Int`	-	500
`machine_mem_gb`	`Int`	-	32
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`skip_mark_dupes`	`Boolean`	skip Picard MarkDuplicates step after alignment. This is recommended to be set to true for PCR amplicon based data. (Default: false)	false
`plot_only_non_duplicates`	`Boolean`	-	false
`bin_large_plots`	`Boolean`	-	false
`binning_summary_statistic`	`String?`	-	"max"
`plot_width_pixels`	`Int?`	-	1100
`plot_height_pixels`	`Int?`	-	850
`plot_pixels_per_inch`	`Int?`	-	100
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`mark_duplicates`	`Boolean`	-	false
`machine_mem_gb`	`Int`	-	15
`docker`	`String`	-	"quay.io/broadinstitute/viral-assemble:2.5.1.0"
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`sample_name`	`String`	Base name of output files. The 'SM' field in BAM read group headers are also rewritten to this value. Avoid spaces and other filename-unfriendly characters.	basename(basename(basename(reads_unmapped_bam,".bam"),".taxfilt"),".clean")
`run_fastqc`	`Boolean`	-	false
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`disk_size`	`Int`	-	750
`machine_mem_gb`	`Int`	-	4
`out_basename`	`String`	-	basename(aligned_bam,'.bam')
`docker`	`String`	-	"quay.io/broadinstitute/viral-phylo:2.5.1.0"
`skip_mark_dupes`	`Boolean`	skip Picard MarkDuplicates step after alignment. This is recommended to be set to true for PCR amplicon based data. (Default: false)	false
`plot_only_non_duplicates`	`Boolean`	-	false
`bin_large_plots`	`Boolean`	-	false
`binning_summary_statistic`	`String?`	-	"max"
`plot_width_pixels`	`Int?`	-	1100
`plot_height_pixels`	`Int?`	-	850
`plot_pixels_per_inch`	`Int?`	-	100
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"

Outputs

Name	Type	Expression
`assembly_fasta`	`File`	`call_consensus.refined_assembly_fasta`
`align_to_ref_variants_vcf_gz`	`File`	`call_consensus.sites_vcf_gz`
`assembly_length`	`Int`	`call_consensus.assembly_length`
`assembly_length_unambiguous`	`Int`	`call_consensus.assembly_length_unambiguous`
`reference_genome_length`	`Int`	`align_to_ref.reference_length[0]`
`assembly_mean_coverage`	`Float`	`plot_ref_coverage.mean_coverage`
`dist_to_ref_snps`	`Int`	`call_consensus.dist_to_ref_snps`
`dist_to_ref_indels`	`Int`	`call_consensus.dist_to_ref_indels`
`primer_trimmed_read_count`	`Array[Int]`	`ivar_trim.primer_trimmed_read_count`
`primer_trimmed_read_percent`	`Array[Float]`	`ivar_trim.primer_trimmed_read_percent`
`ivar_trim_stats`	`Array[Map[String,String]]`	`ivar_stats`
`ivar_trim_stats_tsv`	`Array[Array[String]]`	`ivar_stats_row`
`replicate_concordant_sites`	`Int`	`run_discordance.concordant_sites`
`replicate_discordant_snps`	`Int`	`run_discordance.discordant_snps`
`replicate_discordant_indels`	`Int`	`run_discordance.discordant_indels`
`num_read_groups`	`Int`	`run_discordance.num_read_groups`
`num_libraries`	`Int`	`run_discordance.num_libraries`
`replicate_discordant_vcf`	`File`	`run_discordance.discordant_sites_vcf`
`align_to_ref_per_input_aligned_flagstat`	`Array[File]`	`align_to_ref.aligned_bam_flagstat`
`align_to_ref_per_input_reads_provided`	`Array[Int]`	`align_to_ref.reads_provided`
`align_to_ref_per_input_reads_aligned`	`Array[Int]`	`align_to_ref.reads_aligned`
`align_to_ref_merged_aligned_trimmed_only_bam`	`File`	`aligned_trimmed_bam`
`align_to_ref_fastqc`	`File`	`select_first([merge_align_to_ref.fastqc, align_to_ref.aligned_only_reads_fastqc[0]])`
`align_to_ref_merged_coverage_plot`	`File`	`plot_ref_coverage.coverage_plot`
`align_to_ref_merged_coverage_tsv`	`File`	`plot_ref_coverage.coverage_tsv`
`align_to_ref_merged_reads_aligned`	`Int`	`plot_ref_coverage.reads_aligned`
`align_to_ref_merged_read_pairs_aligned`	`Int`	`plot_ref_coverage.read_pairs_aligned`
`align_to_ref_merged_bases_aligned`	`Int`	`floor(plot_ref_coverage.bases_aligned)`
`align_to_ref_isnvs_vcf`	`File`	`isnvs_ref.report_vcf`
`picard_metrics_wgs`	`File`	`alignment_metrics.wgs_metrics`
`picard_metrics_alignment`	`File`	`alignment_metrics.alignment_metrics`
`picard_metrics_insert_size`	`File`	`alignment_metrics.insert_size_metrics`
`samtools_ampliconstats`	`File`	`alignment_metrics.amplicon_stats`
`samtools_ampliconstats_parsed`	`File`	`alignment_metrics.amplicon_stats_parsed`
`align_to_self_merged_aligned_and_unaligned_bam`	`Array[File]`	`align_to_self.aligned_bam`
`align_to_self_merged_aligned_only_bam`	`File`	`aligned_self_bam`
`align_to_self_merged_coverage_plot`	`File`	`plot_self_coverage.coverage_plot`
`align_to_self_merged_coverage_tsv`	`File`	`plot_self_coverage.coverage_tsv`
`align_to_self_merged_reads_aligned`	`Int`	`plot_self_coverage.reads_aligned`
`align_to_self_merged_read_pairs_aligned`	`Int`	`plot_self_coverage.read_pairs_aligned`
`align_to_self_merged_bases_aligned`	`Int`	`floor(plot_self_coverage.bases_aligned)`
`align_to_self_merged_mean_coverage`	`Float`	`plot_self_coverage.mean_coverage`
`align_to_self_isnvs_vcf`	`File`	`isnvs_self.report_vcf`
`assembly_method`	`String`	`"viral-ngs/assemble_refbased"`
`align_to_ref_viral_core_version`	`String`	`align_to_ref.viralngs_version[0]`
`ivar_version`	`String`	`ivar_trim.ivar_version[0]`
`viral_assemble_version`	`String`	`call_consensus.viralngs_version`

Calls

This workflow calls the following tasks or subworkflows:

CALL TASKS `align_to_ref` ↗ → align_reads

Input Mappings (6)

Input	Value
`reference_fasta`	`reference_fasta`
`reads_unmapped_bam`	`reads_unmapped_bam`
`novocraft_license`	`novocraft_license`
`skip_mark_dupes`	`skip_mark_dupes`
`aligner`	`aligner`
`aligner_options`	`align_to_ref_options[aligner]`

CALL TASKS `ivar_trim` ↗

Input Mappings (2)

Input	Value
`aligned_bam`	`align_to_ref.aligned_only_reads_bam`
`trim_coords_bed`	`trim_coords_bed`

CALL TASKS `merge_align_to_ref` ↗ → merge_and_reheader_bams

Input Mappings (3)

Input	Value
`in_bams`	`ivar_trim.aligned_trimmed_bam`
`sample_name`	`sample_name`
`out_basename`	`"~{sample_name}.align_to_ref.trimmed"`

CALL TASKS `isnvs_ref` ↗ → lofreq

Input Mappings (2)

Input	Value
`reference_fasta`	`reference_fasta`
`aligned_bam`	`aligned_trimmed_bam`

CALL TASKS `alignment_metrics` ↗

Input Mappings (4)

Input	Value
`aligned_bam`	`aligned_trimmed_bam`
`ref_fasta`	`reference_fasta`
`primers_bed`	`trim_coords_bed`
`min_coverage`	`min_coverage`

CALL TASKS `run_discordance` ↗

Input Mappings (4)

Input	Value
`reads_aligned_bam`	`aligned_trimmed_bam`
`reference_fasta`	`reference_fasta`
`min_coverage`	`min_coverage + 1`
`out_basename`	`sample_name`

CALL TASKS `plot_ref_coverage` ↗ → plot_coverage

Input Mappings (2)

Input	Value
`aligned_reads_bam`	`aligned_trimmed_bam`
`sample_name`	`sample_name`

CALL TASKS `call_consensus` ↗ → refine_assembly_with_aligned_reads

Input Mappings (6)

Input	Value
`reference_fasta`	`reference_fasta`
`reads_aligned_bam`	`aligned_trimmed_bam`
`min_coverage`	`min_coverage`
`major_cutoff`	`major_cutoff`
`out_basename`	`sample_name`
`sample_name`	`select_first([sample_original_name, sample_name])`

CALL TASKS `align_to_self` ↗ → align_reads

Input Mappings (6)

Input	Value
`reference_fasta`	`call_consensus.refined_assembly_fasta`
`reads_unmapped_bam`	`reads_unmapped_bam`
`novocraft_license`	`novocraft_license`
`skip_mark_dupes`	`skip_mark_dupes`
`aligner`	`aligner`
`aligner_options`	`align_to_self_options[aligner]`

CALL TASKS `merge_align_to_self` ↗ → merge_and_reheader_bams

Input Mappings (3)

Input	Value
`in_bams`	`align_to_self.aligned_only_reads_bam`
`sample_name`	`sample_name`
`out_basename`	`"~{sample_name}.merge_align_to_self"`

CALL TASKS `isnvs_self` ↗ → lofreq

Input Mappings (2)

Input	Value
`reference_fasta`	`call_consensus.refined_assembly_fasta`
`aligned_bam`	`aligned_self_bam`

CALL TASKS `plot_self_coverage` ↗ → plot_coverage

Input Mappings (2)

Input	Value
`aligned_reads_bam`	`aligned_self_bam`
`sample_name`	`sample_name`

Images

Container images used by tasks in this workflow:

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:

isnvs_ref
isnvs_self

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 6 tasks:

alignment_metrics
run_discordance
align_to_ref
merge_align_to_ref
align_to_self
merge_align_to_self

🐳 ~{docker}

~{docker}

Used by 2 tasks:

plot_ref_coverage
plot_self_coverage

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 1 task:

call_consensus

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 1 task:

ivar_trim

← Back to Index

flowchart TD
    Start([assemble_refbased])
    subgraph S1 ["🔃 scatter reads_unmapped_bam in reads_unmapped_bams"]
        direction TB
        N1["align_to_ref
align_reads"]
        N2["ivar_trim"]
    end
    subgraph C1 ["↔️ if length(reads_unmapped_bams) > 1"]
        direction TB
        N3["merge_align_to_ref
merge_and_reheader_bams"]
    end
    N4["isnvs_ref
lofreq"]
    N5["alignment_metrics"]
    N6["run_discordance"]
    N7["plot_ref_coverage
plot_coverage"]
    N8["call_consensus
refine_assembly_with_aligned_reads"]
    subgraph S2 ["🔃 scatter reads_unmapped_bam in reads_unmapped_bams"]
        direction TB
        N9["align_to_self
align_reads"]
    end
    subgraph C2 ["↔️ if length(reads_unmapped_bams) > 1"]
        direction TB
        N10["merge_align_to_self
merge_and_reheader_bams"]
    end
    N11["isnvs_self
lofreq"]
    N12["plot_self_coverage
plot_coverage"]
    N1 --> N2
    N2 --> N3
    N3 --> N4
    N2 --> N4
    N3 --> N5
    N2 --> N5
    N3 --> N6
    N2 --> N6
    N3 --> N7
    N2 --> N7
    N3 --> N8
    N2 --> N8
    N8 --> N9
    N9 --> N10
    N9 --> N11
    N8 --> N11
    N10 --> N11
    N9 --> N12
    N10 --> N12
    Start --> N1
    N7 --> End([End])
    N6 --> End([End])
    N4 --> End([End])
    N12 --> End([End])
    N11 --> End([End])
    N5 --> End([End])
    classDef taskNode fill:#a371f7,stroke:#8b5cf6,stroke-width:2px,color:#fff
    classDef workflowNode fill:#58a6ff,stroke:#1f6feb,stroke-width:2px,color:#fff

version 1.0

import "../tasks/tasks_assembly.wdl" as assembly
import "../tasks/tasks_intrahost.wdl" as intrahost
import "../tasks/tasks_reports.wdl" as reports
import "../tasks/tasks_read_utils.wdl" as read_utils

workflow assemble_refbased {

    meta {
        description: "Reference-based microbial consensus calling. Aligns NGS reads to a singular reference genome, calls a new consensus sequence, and emits: new assembly, reads aligned to provided reference, reads aligned to new assembly, various figures of merit, plots, and QC metrics. The user may provide unaligned reads spread across multiple input files and this workflow will parallelize alignment per input file before merging results prior to consensus calling."
        author: "Broad Viral Genomics"
        email:  "viral-ngs@broadinstitute.org"
        allowNestedInputs: true
        # https://wdl-aid.readthedocs.io/en/latest/usage.html -- can exclude certain inputs from documentation; attempt to do this on with docker nominated variables 
    }

    parameter_meta {
        sample_name: {
            description: "Base name of output files. The 'SM' field in BAM read group headers are also rewritten to this value. Avoid spaces and other filename-unfriendly characters.",
            category: "common"
        }
        reads_unmapped_bams: {
            description: "Unaligned reads in BAM format",
            patterns: ["*.bam"]
        }
        reference_fasta: {
            description: "Reference genome to align reads to.",
            patterns: ["*.fasta"]
        }
        aligner: {
            description: "Read aligner software to use. Options: novoalign, bwa, minimap2. Minimap2 can automatically handle Illumina, PacBio, or Oxford Nanopore reads as long as the 'PL' field in the BAM read group header is set properly (novoalign and bwa are Illumina-only)."
        }
        skip_mark_dupes: {
            description: "skip Picard MarkDuplicates step after alignment. This is recommended to be set to true for PCR amplicon based data. (Default: false)"
        }
        trim_coords_bed: {
            description: "optional primers to trim in reference coordinate space (0-based BED format)",
            patterns: ["*.bed"],
            category: "common"
        }
        min_coverage: {
            description: "Minimum read coverage required to call a position unambiguous."
        }
        major_cutoff: {
            description: "If the major allele is present at a frequency higher than this cutoff, we will call an unambiguous base at that position.  If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position."
        }
        assembly_fasta: { 
            description: "The new assembly / consensus sequence for this sample" 
        }
        align_to_ref_variants_vcf_gz: { 
            description: "All variants in the input reads against the original reference genome. This VCF file is used to create the assembly_fasta" 
        }
        assembly_length: { 
            description: "The length of the sequence described in assembly_fasta, inclusive of any uncovered regions denoted by Ns" 
        }
        assembly_length_unambiguous: { 
            description: "The number of called consensus bases in assembly_fasta (excludes regions of the genome that lack read coverage)" 
        }
    }

    input {
        Array[File]+ reads_unmapped_bams
        File         reference_fasta
        String       sample_name = basename(reads_unmapped_bams[0], '.bam')
        String?      sample_original_name

        String       aligner="minimap2"
        File?        novocraft_license
        Int          min_coverage=3
        Float        major_cutoff=0.75
        Boolean      skip_mark_dupes=false
        File?        trim_coords_bed

        Map[String,String] align_to_ref_options = {
                            "novoalign": "-r Random -l 40 -g 40 -x 20 -t 501 -k",
                            "bwa": "-k 12 -B 1",
                            "minimap2": ""
                            }
        Map[String,String] align_to_self_options = {
                            "novoalign": "-r Random -l 40 -g 40 -x 20 -t 100",
                            "bwa": "",
                            "minimap2": ""
                            }
    }

    scatter(reads_unmapped_bam in reads_unmapped_bams) {
        call assembly.align_reads as align_to_ref {
            input:
                reference_fasta    = reference_fasta,
                reads_unmapped_bam = reads_unmapped_bam,
                novocraft_license  = novocraft_license,
                skip_mark_dupes    = skip_mark_dupes,
                aligner            = aligner,
                aligner_options    = align_to_ref_options[aligner]
        }
        call assembly.ivar_trim {
            input:
                aligned_bam     = align_to_ref.aligned_only_reads_bam,
                trim_coords_bed = trim_coords_bed
        }
        Map[String,String] ivar_stats = {
            'file': basename(reads_unmapped_bam, '.bam'),
            'trim_percent': ivar_trim.primer_trimmed_read_percent,
            'trim_count':   ivar_trim.primer_trimmed_read_count
        }
        Array[String] ivar_stats_row = [ivar_stats['file'], ivar_stats['trim_percent'], ivar_stats['trim_count']]
    }

    if(length(reads_unmapped_bams)>1) {
        call read_utils.merge_and_reheader_bams as merge_align_to_ref {
            input:
                in_bams      = ivar_trim.aligned_trimmed_bam,
                sample_name  = sample_name,
                out_basename = "~{sample_name}.align_to_ref.trimmed"
        }
    }
    File aligned_trimmed_bam = select_first([merge_align_to_ref.out_bam, ivar_trim.aligned_trimmed_bam[0]])

    call intrahost.lofreq as isnvs_ref {
        input:
            reference_fasta = reference_fasta,
            aligned_bam     = aligned_trimmed_bam
    }

    call reports.alignment_metrics {
        input:
            aligned_bam = aligned_trimmed_bam,
            ref_fasta   = reference_fasta,
            primers_bed = trim_coords_bed,
            min_coverage = min_coverage
    }

    call assembly.run_discordance {
        input:
            reads_aligned_bam = aligned_trimmed_bam,
            reference_fasta   = reference_fasta,
            min_coverage      = min_coverage+1,
            out_basename      = sample_name
    }

    call reports.plot_coverage as plot_ref_coverage {
        input:
            aligned_reads_bam = aligned_trimmed_bam,
            sample_name       = sample_name
    }

    call assembly.refine_assembly_with_aligned_reads as call_consensus {
        input:
            reference_fasta   = reference_fasta,
            reads_aligned_bam = aligned_trimmed_bam,
            min_coverage      = min_coverage,
            major_cutoff      = major_cutoff,
            out_basename      = sample_name,
            sample_name       = select_first([sample_original_name, sample_name])
    }

    scatter(reads_unmapped_bam in reads_unmapped_bams) {
        call assembly.align_reads as align_to_self {
            input:
                reference_fasta    = call_consensus.refined_assembly_fasta,
                reads_unmapped_bam = reads_unmapped_bam,
                novocraft_license  = novocraft_license,
                skip_mark_dupes    = skip_mark_dupes,
                aligner            = aligner,
                aligner_options    = align_to_self_options[aligner]
        }
    }

    if(length(reads_unmapped_bams)>1) {
        call read_utils.merge_and_reheader_bams as merge_align_to_self {
            input:
                in_bams      = align_to_self.aligned_only_reads_bam,
                sample_name  = sample_name,
                out_basename = "~{sample_name}.merge_align_to_self"
        }
    }
    File aligned_self_bam = select_first([merge_align_to_self.out_bam, align_to_self.aligned_only_reads_bam[0]])

    call intrahost.lofreq as isnvs_self {
        input:
            reference_fasta = call_consensus.refined_assembly_fasta,
            aligned_bam     = aligned_self_bam
    }

    call reports.plot_coverage as plot_self_coverage {
        input:
            aligned_reads_bam = aligned_self_bam,
            sample_name       = sample_name
    }

    output {
        File        assembly_fasta                               = call_consensus.refined_assembly_fasta
        File        align_to_ref_variants_vcf_gz                 = call_consensus.sites_vcf_gz
        Int         assembly_length                              = call_consensus.assembly_length
        Int         assembly_length_unambiguous                  = call_consensus.assembly_length_unambiguous
        Int         reference_genome_length                      = align_to_ref.reference_length[0]
        Float       assembly_mean_coverage                       = plot_ref_coverage.mean_coverage
        
        Int         dist_to_ref_snps                             = call_consensus.dist_to_ref_snps
        Int         dist_to_ref_indels                           = call_consensus.dist_to_ref_indels
        
        Array[Int]                primer_trimmed_read_count      = ivar_trim.primer_trimmed_read_count
        Array[Float]              primer_trimmed_read_percent    = ivar_trim.primer_trimmed_read_percent
        Array[Map[String,String]] ivar_trim_stats                = ivar_stats
        Array[Array[String]]      ivar_trim_stats_tsv            = ivar_stats_row
        
        Int         replicate_concordant_sites                   = run_discordance.concordant_sites
        Int         replicate_discordant_snps                    = run_discordance.discordant_snps
        Int         replicate_discordant_indels                  = run_discordance.discordant_indels
        Int         num_read_groups                              = run_discordance.num_read_groups
        Int         num_libraries                                = run_discordance.num_libraries
        File        replicate_discordant_vcf                     = run_discordance.discordant_sites_vcf
        
        Array[File] align_to_ref_per_input_aligned_flagstat      = align_to_ref.aligned_bam_flagstat
        Array[Int]  align_to_ref_per_input_reads_provided        = align_to_ref.reads_provided
        Array[Int]  align_to_ref_per_input_reads_aligned         = align_to_ref.reads_aligned

        File        align_to_ref_merged_aligned_trimmed_only_bam = aligned_trimmed_bam
        File        align_to_ref_fastqc                          = select_first([merge_align_to_ref.fastqc, align_to_ref.aligned_only_reads_fastqc[0]])
        File        align_to_ref_merged_coverage_plot            = plot_ref_coverage.coverage_plot
        File        align_to_ref_merged_coverage_tsv             = plot_ref_coverage.coverage_tsv
        Int         align_to_ref_merged_reads_aligned            = plot_ref_coverage.reads_aligned
        Int         align_to_ref_merged_read_pairs_aligned       = plot_ref_coverage.read_pairs_aligned
        Int         align_to_ref_merged_bases_aligned            = floor(plot_ref_coverage.bases_aligned)
        File        align_to_ref_isnvs_vcf                       = isnvs_ref.report_vcf
        
        File        picard_metrics_wgs                           = alignment_metrics.wgs_metrics
        File        picard_metrics_alignment                     = alignment_metrics.alignment_metrics
        File        picard_metrics_insert_size                   = alignment_metrics.insert_size_metrics
        File        samtools_ampliconstats                       = alignment_metrics.amplicon_stats
        File        samtools_ampliconstats_parsed                = alignment_metrics.amplicon_stats_parsed

        Array[File] align_to_self_merged_aligned_and_unaligned_bam = align_to_self.aligned_bam

        File        align_to_self_merged_aligned_only_bam        = aligned_self_bam
        File        align_to_self_merged_coverage_plot           = plot_self_coverage.coverage_plot
        File        align_to_self_merged_coverage_tsv            = plot_self_coverage.coverage_tsv
        Int         align_to_self_merged_reads_aligned           = plot_self_coverage.reads_aligned
        Int         align_to_self_merged_read_pairs_aligned      = plot_self_coverage.read_pairs_aligned
        Int         align_to_self_merged_bases_aligned           = floor(plot_self_coverage.bases_aligned)
        Float       align_to_self_merged_mean_coverage           = plot_self_coverage.mean_coverage
        File        align_to_self_isnvs_vcf                      = isnvs_self.report_vcf
        
        String      assembly_method = "viral-ngs/assemble_refbased"
        String      align_to_ref_viral_core_version              = align_to_ref.viralngs_version[0]
        String      ivar_version                                 = ivar_trim.ivar_version[0]
        String      viral_assemble_version                       = call_consensus.viralngs_version
    }

}

version 1.0 import "../tasks/tasks_assembly.wdl" as assembly import "../tasks/tasks_intrahost.wdl" as intrahost import "../tasks/tasks_reports.wdl" as reports import "../tasks/tasks_read_utils.wdl" as read_utils workflow assemble_refbased { meta { description: "Reference-based microbial consensus calling. Aligns NGS reads to a singular reference genome, calls a new consensus sequence, and emits: new assembly, reads aligned to provided reference, reads aligned to new assembly, various figures of merit, plots, and QC metrics. The user may provide unaligned reads spread across multiple input files and this workflow will parallelize alignment per input file before merging results prior to consensus calling." author: "Broad Viral Genomics" email: "viral-ngs@broadinstitute.org" allowNestedInputs: true # https://wdl-aid.readthedocs.io/en/latest/usage.html -- can exclude certain inputs from documentation; attempt to do this on with docker nominated variables } parameter_meta { sample_name: { description: "Base name of output files. The 'SM' field in BAM read group headers are also rewritten to this value. Avoid spaces and other filename-unfriendly characters.", category: "common" } reads_unmapped_bams: { description: "Unaligned reads in BAM format", patterns: ["*.bam"] } reference_fasta: { description: "Reference genome to align reads to.", patterns: ["*.fasta"] } aligner: { description: "Read aligner software to use. Options: novoalign, bwa, minimap2. Minimap2 can automatically handle Illumina, PacBio, or Oxford Nanopore reads as long as the 'PL' field in the BAM read group header is set properly (novoalign and bwa are Illumina-only)." } skip_mark_dupes: { description: "skip Picard MarkDuplicates step after alignment. This is recommended to be set to true for PCR amplicon based data. (Default: false)" } trim_coords_bed: { description: "optional primers to trim in reference coordinate space (0-based BED format)", patterns: ["*.bed"], category: "common" } min_coverage: { description: "Minimum read coverage required to call a position unambiguous." } major_cutoff: { description: "If the major allele is present at a frequency higher than this cutoff, we will call an unambiguous base at that position. If it is equal to or below this cutoff, we will call an ambiguous base representing all possible alleles at that position." } assembly_fasta: { description: "The new assembly / consensus sequence for this sample" } align_to_ref_variants_vcf_gz: { description: "All variants in the input reads against the original reference genome. This VCF file is used to create the assembly_fasta" } assembly_length: { description: "The length of the sequence described in assembly_fasta, inclusive of any uncovered regions denoted by Ns" } assembly_length_unambiguous: { description: "The number of called consensus bases in assembly_fasta (excludes regions of the genome that lack read coverage)" } } input { Array[File]+ reads_unmapped_bams File reference_fasta String sample_name = basename(reads_unmapped_bams[0], '.bam') String? sample_original_name String aligner="minimap2" File? novocraft_license Int min_coverage=3 Float major_cutoff=0.75 Boolean skip_mark_dupes=false File? trim_coords_bed Map[String,String] align_to_ref_options = { "novoalign": "-r Random -l 40 -g 40 -x 20 -t 501 -k", "bwa": "-k 12 -B 1", "minimap2": "" } Map[String,String] align_to_self_options = { "novoalign": "-r Random -l 40 -g 40 -x 20 -t 100", "bwa": "", "minimap2": "" } } scatter(reads_unmapped_bam in reads_unmapped_bams) { call assembly.align_reads as align_to_ref { input: reference_fasta = reference_fasta, reads_unmapped_bam = reads_unmapped_bam, novocraft_license = novocraft_license, skip_mark_dupes = skip_mark_dupes, aligner = aligner, aligner_options = align_to_ref_options[aligner] } call assembly.ivar_trim { input: aligned_bam = align_to_ref.aligned_only_reads_bam, trim_coords_bed = trim_coords_bed } Map[String,String] ivar_stats = { 'file': basename(reads_unmapped_bam, '.bam'), 'trim_percent': ivar_trim.primer_trimmed_read_percent, 'trim_count': ivar_trim.primer_trimmed_read_count } Array[String] ivar_stats_row = [ivar_stats['file'], ivar_stats['trim_percent'], ivar_stats['trim_count']] } if(length(reads_unmapped_bams)>1) { call read_utils.merge_and_reheader_bams as merge_align_to_ref { input: in_bams = ivar_trim.aligned_trimmed_bam, sample_name = sample_name, out_basename = "~{sample_name}.align_to_ref.trimmed" } } File aligned_trimmed_bam = select_first([merge_align_to_ref.out_bam, ivar_trim.aligned_trimmed_bam[0]]) call intrahost.lofreq as isnvs_ref { input: reference_fasta = reference_fasta, aligned_bam = aligned_trimmed_bam } call reports.alignment_metrics { input: aligned_bam = aligned_trimmed_bam, ref_fasta = reference_fasta, primers_bed = trim_coords_bed, min_coverage = min_coverage } call assembly.run_discordance { input: reads_aligned_bam = aligned_trimmed_bam, reference_fasta = reference_fasta, min_coverage = min_coverage+1, out_basename = sample_name } call reports.plot_coverage as plot_ref_coverage { input: aligned_reads_bam = aligned_trimmed_bam, sample_name = sample_name } call assembly.refine_assembly_with_aligned_reads as call_consensus { input: reference_fasta = reference_fasta, reads_aligned_bam = aligned_trimmed_bam, min_coverage = min_coverage, major_cutoff = major_cutoff, out_basename = sample_name, sample_name = select_first([sample_original_name, sample_name]) } scatter(reads_unmapped_bam in reads_unmapped_bams) { call assembly.align_reads as align_to_self { input: reference_fasta = call_consensus.refined_assembly_fasta, reads_unmapped_bam = reads_unmapped_bam, novocraft_license = novocraft_license, skip_mark_dupes = skip_mark_dupes, aligner = aligner, aligner_options = align_to_self_options[aligner] } } if(length(reads_unmapped_bams)>1) { call read_utils.merge_and_reheader_bams as merge_align_to_self { input: in_bams = align_to_self.aligned_only_reads_bam, sample_name = sample_name, out_basename = "~{sample_name}.merge_align_to_self" } } File aligned_self_bam = select_first([merge_align_to_self.out_bam, align_to_self.aligned_only_reads_bam[0]]) call intrahost.lofreq as isnvs_self { input: reference_fasta = call_consensus.refined_assembly_fasta, aligned_bam = aligned_self_bam } call reports.plot_coverage as plot_self_coverage { input: aligned_reads_bam = aligned_self_bam, sample_name = sample_name } output { File assembly_fasta = call_consensus.refined_assembly_fasta File align_to_ref_variants_vcf_gz = call_consensus.sites_vcf_gz Int assembly_length = call_consensus.assembly_length Int assembly_length_unambiguous = call_consensus.assembly_length_unambiguous Int reference_genome_length = align_to_ref.reference_length[0] Float assembly_mean_coverage = plot_ref_coverage.mean_coverage Int dist_to_ref_snps = call_consensus.dist_to_ref_snps Int dist_to_ref_indels = call_consensus.dist_to_ref_indels Array[Int] primer_trimmed_read_count = ivar_trim.primer_trimmed_read_count Array[Float] primer_trimmed_read_percent = ivar_trim.primer_trimmed_read_percent Array[Map[String,String]] ivar_trim_stats = ivar_stats Array[Array[String]] ivar_trim_stats_tsv = ivar_stats_row Int replicate_concordant_sites = run_discordance.concordant_sites Int replicate_discordant_snps = run_discordance.discordant_snps Int replicate_discordant_indels = run_discordance.discordant_indels Int num_read_groups = run_discordance.num_read_groups Int num_libraries = run_discordance.num_libraries File replicate_discordant_vcf = run_discordance.discordant_sites_vcf Array[File] align_to_ref_per_input_aligned_flagstat = align_to_ref.aligned_bam_flagstat Array[Int] align_to_ref_per_input_reads_provided = align_to_ref.reads_provided Array[Int] align_to_ref_per_input_reads_aligned = align_to_ref.reads_aligned File align_to_ref_merged_aligned_trimmed_only_bam = aligned_trimmed_bam File align_to_ref_fastqc = select_first([merge_align_to_ref.fastqc, align_to_ref.aligned_only_reads_fastqc[0]]) File align_to_ref_merged_coverage_plot = plot_ref_coverage.coverage_plot File align_to_ref_merged_coverage_tsv = plot_ref_coverage.coverage_tsv Int align_to_ref_merged_reads_aligned = plot_ref_coverage.reads_aligned Int align_to_ref_merged_read_pairs_aligned = plot_ref_coverage.read_pairs_aligned Int align_to_ref_merged_bases_aligned = floor(plot_ref_coverage.bases_aligned) File align_to_ref_isnvs_vcf = isnvs_ref.report_vcf File picard_metrics_wgs = alignment_metrics.wgs_metrics File picard_metrics_alignment = alignment_metrics.alignment_metrics File picard_metrics_insert_size = alignment_metrics.insert_size_metrics File samtools_ampliconstats = alignment_metrics.amplicon_stats File samtools_ampliconstats_parsed = alignment_metrics.amplicon_stats_parsed Array[File] align_to_self_merged_aligned_and_unaligned_bam = align_to_self.aligned_bam File align_to_self_merged_aligned_only_bam = aligned_self_bam File align_to_self_merged_coverage_plot = plot_self_coverage.coverage_plot File align_to_self_merged_coverage_tsv = plot_self_coverage.coverage_tsv Int align_to_self_merged_reads_aligned = plot_self_coverage.reads_aligned Int align_to_self_merged_read_pairs_aligned = plot_self_coverage.read_pairs_aligned Int align_to_self_merged_bases_aligned = floor(plot_self_coverage.bases_aligned) Float align_to_self_merged_mean_coverage = plot_self_coverage.mean_coverage File align_to_self_isnvs_vcf = isnvs_self.report_vcf String assembly_method = "viral-ngs/assemble_refbased" String align_to_ref_viral_core_version = align_to_ref.viralngs_version[0] String ivar_version = ivar_trim.ivar_version[0] String viral_assemble_version = call_consensus.viralngs_version } }

WORKFLOW assemble_refbased

Imports

Workflow: assemble_refbased

Inputs

Outputs

Calls

CALL TASKS align_to_ref ↗ → align_reads

CALL TASKS ivar_trim ↗

CALL TASKS merge_align_to_ref ↗ → merge_and_reheader_bams

CALL TASKS isnvs_ref ↗ → lofreq

CALL TASKS alignment_metrics ↗

CALL TASKS run_discordance ↗

CALL TASKS plot_ref_coverage ↗ → plot_coverage

CALL TASKS call_consensus ↗ → refine_assembly_with_aligned_reads

CALL TASKS align_to_self ↗ → align_reads

CALL TASKS merge_align_to_self ↗ → merge_and_reheader_bams

CALL TASKS isnvs_self ↗ → lofreq

CALL TASKS plot_self_coverage ↗ → plot_coverage

Images

assemble_refbased - Workflow Graph

assemble_refbased - WDL Source Code

CALL TASKS `align_to_ref` ↗ → align_reads

CALL TASKS `ivar_trim` ↗

CALL TASKS `merge_align_to_ref` ↗ → merge_and_reheader_bams

CALL TASKS `isnvs_ref` ↗ → lofreq

CALL TASKS `alignment_metrics` ↗

CALL TASKS `run_discordance` ↗

CALL TASKS `plot_ref_coverage` ↗ → plot_coverage

CALL TASKS `call_consensus` ↗ → refine_assembly_with_aligned_reads

CALL TASKS `align_to_self` ↗ → align_reads

CALL TASKS `merge_align_to_self` ↗ → merge_and_reheader_bams

CALL TASKS `isnvs_self` ↗ → lofreq

CALL TASKS `plot_self_coverage` ↗ → plot_coverage