metagenomic_denovo

WORKFLOW metagenomic_denovo

File Path	`pipes/WDL/workflows/metagenomic_denovo.wdl`
WDL Version	1.0
Type	workflow

Imports

Namespace	Path
`taxon_filter`	`../tasks/tasks_taxon_filter.wdl`
`read_utils`	`../tasks/tasks_read_utils.wdl`
`assembly`	`../tasks/tasks_assembly.wdl`
`metagenomics`	`../tasks/tasks_metagenomics.wdl`
`reports`	`../tasks/tasks_reports.wdl`
`assemble_refbased`	`assemble_refbased.wdl`

Workflow: metagenomic_denovo

Assisted de novo viral genome assembly (SPAdes, scaffolding, and polishing) from metagenomic raw reads. Runs raw reads through taxonomic classification (Kraken2), human read depletion (based on Kraken2 and optionally using BWA, BLASTN, and/or BMTAGGER databases), and FASTQC/multiQC of reads.

Author: Broad Viral Genomics

viral-ngs@broadinstitute.org

Inputs

Name	Type	Description	Default
`sample_name`	`String`	Sample name. This is required and will populate the 'SM' read group value and will be used as the output filename (must be filename-friendly).	-
`fastq_1`	`File`	Unaligned read1 file in fastq format	-
`fastq_2`	`File?`	Unaligned read2 file in fastq format. This should be empty for single-end read conversion and required for paired-end reads. If provided, it must match fastq_1 in length and order.	-
`run_date_iso`	`String?`	-	-
`sequencing_platform`	`String`	Sequencing platform. This is required and will populate the 'PL' read group value. Must be one of CAPILLARY, DNBSEQ, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT, PACBIO, or SOLID.	-
`reference_genome_fasta`	`Array[File]+`	After denovo assembly, large contigs are scaffolded against a reference genome to determine orientation and to join contigs together, before further polishing by reads. You must supply at least one reference genome (all segments/chromomes in a single fasta file). If more than one reference is provided, contigs will be scaffolded against all of them and the one with the most complete assembly will be chosen for downstream polishing.	-
`kraken2_db_tgz`	`File`	Pre-built Kraken database tarball containing three files: hash.k2d, opts.k2d, and taxo.k2d.	-
`krona_taxonomy_tab`	`File`	Krona taxonomy database containing a single file: taxonomy.tab, or possibly just a compressed taxonomy.tab	-
`ncbi_taxdump_tgz`	`File`	An NCBI taxdump.tar.gz file that contains, at the minimum, a nodes.dmp and names.dmp file.	-
`filter_to_taxon_db`	`File?`	Optional database to use to filter read set to those that match by LASTAL. Sequences in fasta format will be indexed on the fly.	-
`spikein_db`	`File?`	ERCC/SDSI spike-in sequences	-
`trim_clip_db`	`File`	Adapter sequences to remove via trimmomatic prior to SPAdes assembly	-
`readgroup_name`	`String?`	-	-
`platform_unit`	`String?`	-	-
`sequencing_center`	`String?`	-	-
`additional_picard_options`	`String?`	-	-
`machine_mem_gb`	`Int?`	-	-
`min_base_qual`	`Int?`	-	-
`taxonomic_ids`	`Array[Int]?`	-	-
`minimum_hit_groups`	`Int?`	-	-
`query_chunk_size`	`Int?`	-	-
`taxonomic_ids`	`Array[Int]?`	-	-
`minimum_hit_groups`	`Int?`	-	-
`spades_min_contig_len`	`Int?`	-	-
`spades_options`	`String?`	-	-
`machine_mem_gb`	`Int?`	-	-
`min_length_fraction`	`Float?`	-	-
`min_unambig`	`Float?`	-	-
`skani_m`	`Int?`	-	-
`skani_s`	`Int?`	-	-
`skani_c`	`Int?`	-	-
`nucmer_max_gap`	`Int?`	-	-
`nucmer_min_match`	`Int?`	-	-
`nucmer_min_cluster`	`Int?`	-	-
`scaffold_min_contig_len`	`Int?`	-	-
`scaffold_min_pct_contig_aligned`	`Float?`	-	-
`machine_mem_gb`	`Int?`	-	-
`sample_original_name`	`String?`	-	-
`novocraft_license`	`File?`	-	-
`trim_coords_bed`	`File?`	-	-
`machine_mem_gb`	`Int?`	-	-
`min_keep_length`	`Int?`	-	-
`sliding_window`	`Int?`	-	-
`primer_offset`	`Int?`	-	-
`machine_mem_gb`	`Int?`	-	-
`reheader_table`	`File?`	-	-
`amplicon_set`	`String?`	-	-
`max_coverage_depth`	`Int?`	-	-
`base_q_threshold`	`Int?`	-	-
`mapping_q_threshold`	`Int?`	-	-
`read_length_threshold`	`Int?`	-	-
`plotXLimits`	`String?`	-	-
`plotYLimits`	`String?`	-	-
`machine_mem_gb`	`Int?`	-	-
`reheader_table`	`File?`	-	-
`max_coverage_depth`	`Int?`	-	-
`base_q_threshold`	`Int?`	-	-
`mapping_q_threshold`	`Int?`	-	-
`read_length_threshold`	`Int?`	-	-
`plotXLimits`	`String?`	-	-
`plotYLimits`	`String?`	-	-
99 optional inputs with default values
`library_name`	`String`	-	"1"
`deplete_bmtaggerDbs`	`Array[File]`	Optional list of databases to use for bmtagger-based depletion. Sequences in fasta format will be indexed on the fly, pre-bmtagger-indexed databases may be provided as tarballs.	[]
`deplete_blastDbs`	`Array[File]`	Optional list of databases to use for blastn-based depletion. Sequences in fasta format will be indexed on the fly, pre-blast-indexed databases may be provided as tarballs.	[]
`deplete_bwaDbs`	`Array[File]`	Optional list of databases to use for bwa mem-based depletion. Sequences in fasta format will be indexed on the fly, pre-bwa-indexed databases may be provided as tarballs.	[]
`taxa_to_dehost`	`Array[String]`	-	["Vertebrata"]
`taxa_to_avoid_assembly`	`Array[String]`	-	["Vertebrata", "other sequences", "Bacteria"]
`cpus`	`Int`	-	2
`mem_gb`	`Int`	-	4
`disk_size`	`Int`	-	750
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`topNHits`	`Int`	-	3
`filter_bam_to_proper_primary_mapped_reads`	`Boolean`	-	true
`do_not_require_proper_mapped_pairs_when_filtering`	`Boolean`	-	false
`keep_singletons_when_filtering`	`Boolean`	-	false
`keep_duplicates_when_filtering`	`Boolean`	-	false
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`confidence_threshold`	`Float?`	-	0.05
`machine_mem_gb`	`Int`	-	90
`docker`	`String`	-	"quay.io/broadinstitute/viral-classify:2.5.1.0"
`withoutChildren`	`Boolean`	-	false
`run_fastqc`	`Boolean`	-	false
`machine_mem_gb`	`Int`	-	26 + 10 * ceil(size(classified_reads_txt_gz,"GB"))
`docker`	`String`	-	"quay.io/broadinstitute/viral-classify:2.5.1.0"
`clear_tags`	`Boolean`	-	false
`tags_to_clear_space_separated`	`String`	-	"XT X0 X1 XA AM SM BQ CT XN OC OP"
`cpu`	`Int`	-	8
`machine_mem_gb`	`Int`	-	15
`docker`	`String`	-	"quay.io/broadinstitute/viral-classify:2.5.1.0"
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`withoutChildren`	`Boolean`	-	false
`run_fastqc`	`Boolean`	-	false
`machine_mem_gb`	`Int`	-	26 + 10 * ceil(size(classified_reads_txt_gz,"GB"))
`docker`	`String`	-	"quay.io/broadinstitute/viral-classify:2.5.1.0"
`error_on_reads_in_neg_control`	`Boolean`	-	false
`negative_control_reads_threshold`	`Int`	-	0
`neg_control_prefixes_space_separated`	`String`	-	"neg water NTC"
`machine_mem_gb`	`Int`	-	15
`docker`	`String`	-	"quay.io/broadinstitute/viral-classify:2.5.1.0"
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`method`	`String`	-	"mvicuna"
`machine_mem_gb`	`Int`	-	7
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`spades_n_reads`	`Int`	-	10000000
`docker`	`String`	-	"quay.io/broadinstitute/viral-assemble:2.5.1.0"
`aligner`	`String`	-	"muscle"
`replace_length`	`Int`	-	55
`allow_incomplete_output`	`Boolean`	-	false
`docker`	`String`	-	"quay.io/broadinstitute/viral-assemble:2.5.1.0"
`aligner`	`String`	-	"minimap2"
`min_coverage`	`Int`	-	3
`major_cutoff`	`Float`	-	0.75
`skip_mark_dupes`	`Boolean`	-	false
`align_to_ref_options`	`Map[String,String]`	-	{"novoalign": "-r Random -l 40 -g 40 -x 20 -t 501 -k", "bwa": "-k 12 -B 1", "minimap2": ""}
`align_to_self_options`	`Map[String,String]`	-	{"novoalign": "-r Random -l 40 -g 40 -x 20 -t 100", "bwa": "", "minimap2": ""}
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`sample_name`	`String`	Sample name. This is required and will populate the 'SM' read group value and will be used as the output filename (must be filename-friendly).	basename(basename(basename(reads_unmapped_bam,".bam"),".taxfilt"),".clean")
`min_quality`	`Int?`	-	1
`docker`	`String`	-	"andersenlabapps/ivar:1.3.1"
`bam_basename`	`String`	-	basename(aligned_bam,".bam")
`disk_size`	`Int`	-	375
`run_fastqc`	`Boolean`	-	false
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`disk_size`	`Int`	-	750
`machine_mem_gb`	`Int`	-	4
`out_basename`	`String`	-	basename(aligned_bam,'.bam')
`docker`	`String`	-	"quay.io/broadinstitute/viral-phylo:2.5.1.0"
`max_amp_len`	`Int`	-	5000
`max_amplicons`	`Int`	-	500
`machine_mem_gb`	`Int`	-	32
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`skip_mark_dupes`	`Boolean`	-	false
`plot_only_non_duplicates`	`Boolean`	-	false
`bin_large_plots`	`Boolean`	-	false
`binning_summary_statistic`	`String?`	-	"max"
`plot_width_pixels`	`Int?`	-	1100
`plot_height_pixels`	`Int?`	-	850
`plot_pixels_per_inch`	`Int?`	-	100
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`mark_duplicates`	`Boolean`	-	false
`machine_mem_gb`	`Int`	-	15
`docker`	`String`	-	"quay.io/broadinstitute/viral-assemble:2.5.1.0"
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`sample_name`	`String`	Sample name. This is required and will populate the 'SM' read group value and will be used as the output filename (must be filename-friendly).	basename(basename(basename(reads_unmapped_bam,".bam"),".taxfilt"),".clean")
`run_fastqc`	`Boolean`	-	false
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"
`disk_size`	`Int`	-	750
`machine_mem_gb`	`Int`	-	4
`out_basename`	`String`	-	basename(aligned_bam,'.bam')
`docker`	`String`	-	"quay.io/broadinstitute/viral-phylo:2.5.1.0"
`skip_mark_dupes`	`Boolean`	-	false
`plot_only_non_duplicates`	`Boolean`	-	false
`bin_large_plots`	`Boolean`	-	false
`binning_summary_statistic`	`String?`	-	"max"
`plot_width_pixels`	`Int?`	-	1100
`plot_height_pixels`	`Int?`	-	850
`plot_pixels_per_inch`	`Int?`	-	100
`docker`	`String`	-	"quay.io/broadinstitute/viral-core:2.5.1"

Outputs

Name	Type	Expression
`final_assembly_fasta`	`File`	`refine.assembly_fasta`
`aligned_only_reads_bam`	`File`	`refine.align_to_self_merged_aligned_only_bam`
`coverage_plot`	`File`	`refine.align_to_self_merged_coverage_plot`
`assembly_length`	`Int`	`refine.assembly_length`
`assembly_length_unambiguous`	`Int`	`refine.assembly_length_unambiguous`
`reads_aligned`	`Int`	`refine.align_to_self_merged_reads_aligned`
`mean_coverage`	`Float`	`refine.align_to_self_merged_mean_coverage`
`kraken2_summary_report`	`File`	`kraken2.kraken2_summary_report`
`kraken2_krona_plot`	`File`	`kraken2.krona_report_html`
`raw_unmapped_bam`	`File`	`reads_bam`
`depleted_bam`	`File`	`dehosted_bam`
`taxfilt_bam`	`File`	`taxfiltered_bam`
`dedup_bam`	`File`	`rmdup_ubam.dedup_bam`
`denovo_in_bam`	`File`	`assemble.subsampBam`
`read_counts_raw`	`Int`	`deplete_k2.classified_taxonomic_filter_read_count_pre`
`read_counts_depleted`	`Int`	`select_first([deplete_taxa.depletion_read_count_post, deplete_k2.classified_taxonomic_filter_read_count_post])`
`read_counts_taxfilt`	`Int`	`select_first([filter_to_taxon.filter_read_count_post, filter_acellular.classified_taxonomic_filter_read_count_post])`
`read_counts_dedup`	`Int`	`rmdup_ubam.dedup_read_count_post`
`read_counts_denovo_in`	`Int`	`assemble.subsample_read_count`
`raw_fastqc`	`File`	`fastqc_raw.fastqc_html`
`depleted_fastqc`	`File`	`fastqc_dehosted.fastqc_html`
`taxfilt_fastqc`	`File`	`fastqc_taxfilt.fastqc_html`
`dedup_fastqc`	`File`	`rmdup_ubam.dedup_fastqc`
`contigs_fasta`	`File`	`assemble.contigs_fasta`
`scaffold_fasta`	`File`	`scaffold.scaffold_fasta`
`intermediate_scaffold_fasta`	`File`	`scaffold.intermediate_scaffold_fasta`
`intermediate_gapfill_fasta`	`File`	`scaffold.intermediate_gapfill_fasta`
`assembly_preimpute_length`	`Int`	`scaffold.assembly_preimpute_length`
`assembly_preimpute_length_unambiguous`	`Int`	`scaffold.assembly_preimpute_length_unambiguous`
`scaffolding_chosen_ref_names`	`Array[String]`	`scaffold.scaffolding_chosen_ref_names`
`scaffolding_stats`	`File`	`scaffold.scaffolding_stats`
`scaffolding_alt_contigs`	`File`	`scaffold.scaffolding_alt_contigs`
`replicate_concordant_sites`	`Int`	`refine.replicate_concordant_sites`
`replicate_discordant_snps`	`Int`	`refine.replicate_discordant_snps`
`replicate_discordant_indels`	`Int`	`refine.replicate_discordant_indels`
`num_read_groups`	`Int`	`refine.num_read_groups`
`num_libraries`	`Int`	`refine.num_libraries`
`replicate_discordant_vcf`	`File`	`refine.replicate_discordant_vcf`
`isnvs_vcf`	`File`	`refine.align_to_self_isnvs_vcf`
`aligned_bam`	`File`	`refine.align_to_self_merged_aligned_only_bam`
`aligned_only_reads_fastqc`	`File`	`refine.align_to_ref_fastqc`
`coverage_tsv`	`File`	`refine.align_to_self_merged_coverage_tsv`
`read_pairs_aligned`	`Int`	`refine.align_to_self_merged_read_pairs_aligned`
`bases_aligned`	`Float`	`refine.align_to_self_merged_bases_aligned`
`spikein_hits`	`File?`	`spikein.report`
`spikein_tophit`	`String?`	`spikein.top_hit_id`
`spikein_pct_of_total_reads`	`String?`	`spikein.pct_total_reads_mapped`
`spikein_pct_lesser_hits`	`String?`	`spikein.pct_lesser_hits_of_mapped`
`viral_classify_version`	`String`	`kraken2.viralngs_version`
`viral_assemble_version`	`String`	`assemble.viralngs_version`

Calls

This workflow calls the following tasks or subworkflows:

CALL TASKS `FastqToUBAM` ↗

Input Mappings (6)

Input	Value
`fastq_1`	`fastq_1`
`fastq_2`	`fastq_2`
`sample_name`	`sample_name`
`library_name`	`library_name`
`run_date`	`run_date_iso`
`platform_name`	`sequencing_platform`

CALL TASKS `fastqc_raw` ↗ → fastqc

Input Mappings (1)

Input	Value
`reads_bam`	`reads_bam`

CALL TASKS `spikein` ↗ → align_and_count

Input Mappings (2)

Input	Value
`reads_bam`	`reads_bam`
`ref_db`	`select_first([spikein_db])`

CALL TASKS `kraken2` ↗

Input Mappings (3)

Input	Value
`reads_bam`	`reads_bam`
`kraken2_db_tgz`	`kraken2_db_tgz`
`krona_taxonomy_db_tgz`	`krona_taxonomy_tab`

CALL TASKS `deplete_k2` ↗ → filter_bam_to_taxa

Input Mappings (6)

Input	Value
`classified_bam`	`reads_bam`
`classified_reads_txt_gz`	`kraken2.kraken2_reads_report`
`ncbi_taxonomy_db_tgz`	`ncbi_taxdump_tgz`
`exclude_taxa`	`true`
`taxonomic_names`	`taxa_to_dehost`
`out_filename_suffix`	`"host_depleted"`

CALL TASKS `deplete_taxa` ↗

Input Mappings (4)

Input	Value
`raw_reads_unmapped_bam`	`deplete_k2.bam_filtered_to_taxa`
`bmtaggerDbs`	`deplete_bmtaggerDbs`
`blastDbs`	`deplete_blastDbs`
`bwaDbs`	`deplete_bwaDbs`

CALL TASKS `fastqc_dehosted` ↗ → fastqc

Input Mappings (1)

Input	Value
`reads_bam`	`dehosted_bam`

CALL TASKS `filter_acellular` ↗ → filter_bam_to_taxa

Input Mappings (6)

Input	Value
`classified_bam`	`dehosted_bam`
`classified_reads_txt_gz`	`kraken2.kraken2_reads_report`
`ncbi_taxonomy_db_tgz`	`ncbi_taxdump_tgz`
`exclude_taxa`	`true`
`taxonomic_names`	`taxa_to_avoid_assembly`
`out_filename_suffix`	`"acellular"`

CALL TASKS `filter_to_taxon` ↗

Input Mappings (2)

Input	Value
`reads_unmapped_bam`	`dehosted_bam`
`lastal_db_fasta`	`select_first([filter_to_taxon_db])`

CALL TASKS `fastqc_taxfilt` ↗ → fastqc

Input Mappings (1)

Input	Value
`reads_bam`	`taxfiltered_bam`

CALL TASKS `rmdup_ubam` ↗

Input Mappings (1)

Input	Value
`reads_unmapped_bam`	`taxfiltered_bam`

CALL TASKS `assemble` ↗

Input Mappings (4)

Input	Value
`reads_unmapped_bam`	`rmdup_ubam.dedup_bam`
`trim_clip_db`	`trim_clip_db`
`always_succeed`	`true`
`sample_name`	`sample_name`

CALL TASKS `scaffold` ↗

Input Mappings (4)

Input	Value
`contigs_fasta`	`assemble.contigs_fasta`
`reads_bam`	`dehosted_bam`
`sample_name`	`sample_name`
`reference_genome_fasta`	`reference_genome_fasta`

CALL WORKFLOW `refine` ↗ → assemble_refbased

Input Mappings (3)

Input	Value
`reads_unmapped_bams`	`[dehosted_bam]`
`reference_fasta`	`scaffold.scaffold_fasta`
`sample_name`	`sample_name`

Images

Container images used by tasks in this workflow:

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:

FastqToUBAM
rmdup_ubam

🐳 ~{docker}

~{docker}

Used by 4 tasks:

fastqc_raw
fastqc_dehosted
fastqc_taxfilt
spikein

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 5 tasks:

kraken2
deplete_k2
filter_acellular
deplete_taxa
filter_to_taxon

🐳 Parameterized Image

⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:

assemble
scaffold

← Back to Index

flowchart TD
    Start([metagenomic_denovo])
    N1["FastqToUBAM"]
    N2["fastqc_raw
fastqc"]
    subgraph C1 ["↔️ if defined(spikein_db)"]
        direction TB
        N3["spikein
align_and_count"]
    end
    N4["kraken2"]
    N5["deplete_k2
filter_bam_to_taxa"]
    subgraph C2 ["↔️ if length(deplete_bmtaggerDbs) + length(deplete_blastDbs) + length(deplete_bwaDbs) > 0"]
        direction TB
        N6["deplete_taxa"]
    end
    N7["fastqc_dehosted
fastqc"]
    N8["filter_acellular
filter_bam_to_taxa"]
    subgraph C3 ["↔️ if defined(filter_to_taxon_db)"]
        direction TB
        N9["filter_to_taxon"]
    end
    N10["fastqc_taxfilt
fastqc"]
    N11["rmdup_ubam"]
    N12["assemble"]
    N13["scaffold"]
    N14[/"refine
assemble_refbased"/]
    N1 --> N2
    N1 --> N3
    N1 --> N4
    N4 --> N5
    N1 --> N5
    N5 --> N6
    N6 --> N7
    N5 --> N7
    N6 --> N8
    N4 --> N8
    N5 --> N8
    N6 --> N9
    N5 --> N9
    N9 --> N10
    N6 --> N10
    N5 --> N10
    N9 --> N11
    N6 --> N11
    N5 --> N11
    N11 --> N12
    N6 --> N13
    N12 --> N13
    N5 --> N13
    N13 --> N14
    N6 --> N14
    N5 --> N14
    Start --> N1
    N7 --> End([End])
    N14 --> End([End])
    N10 --> End([End])
    N2 --> End([End])
    N8 --> End([End])
    N3 --> End([End])
    classDef taskNode fill:#a371f7,stroke:#8b5cf6,stroke-width:2px,color:#fff
    classDef workflowNode fill:#58a6ff,stroke:#1f6feb,stroke-width:2px,color:#fff

version 1.0

import "../tasks/tasks_taxon_filter.wdl" as taxon_filter
import "../tasks/tasks_read_utils.wdl" as read_utils
import "../tasks/tasks_assembly.wdl" as assembly
import "../tasks/tasks_metagenomics.wdl" as metagenomics
import "../tasks/tasks_reports.wdl" as reports
import "assemble_refbased.wdl" as assemble_refbased

workflow metagenomic_denovo {

  meta {
      description: "Assisted de novo viral genome assembly (SPAdes, scaffolding, and polishing) from metagenomic raw reads. Runs raw reads through taxonomic classification (Kraken2), human read depletion (based on Kraken2 and optionally using BWA, BLASTN, and/or BMTAGGER databases), and FASTQC/multiQC of reads."
      author: "Broad Viral Genomics"
      email:  "viral-ngs@broadinstitute.org"
      allowNestedInputs: true
  }

  input {
    String        sample_name
    File          fastq_1
    File?         fastq_2
    String        library_name="1"
    String?       run_date_iso
    String        sequencing_platform

    Array[File]+  reference_genome_fasta

    File          kraken2_db_tgz
    File          krona_taxonomy_tab
    File          ncbi_taxdump_tgz

    Array[File]   deplete_bmtaggerDbs = []
    Array[File]   deplete_blastDbs = []
    Array[File]   deplete_bwaDbs =[]

    Array[String] taxa_to_dehost = ["Vertebrata"]
    Array[String] taxa_to_avoid_assembly = ["Vertebrata", "other sequences", "Bacteria"]

    File?         filter_to_taxon_db
    File?         spikein_db

    File          trim_clip_db
  }

  parameter_meta {
    fastq_1: { description: "Unaligned read1 file in fastq format", patterns: ["*.fastq", "*.fastq.gz", "*.fq", "*.fq.gz"] }
    fastq_2: { description: "Unaligned read2 file in fastq format. This should be empty for single-end read conversion and required for paired-end reads. If provided, it must match fastq_1 in length and order.", patterns: ["*.fastq", "*.fastq.gz", "*.fq", "*.fq.gz"] }
    sample_name: { description: "Sample name. This is required and will populate the 'SM' read group value and will be used as the output filename (must be filename-friendly)." }
    sequencing_platform: { description: "Sequencing platform. This is required and will populate the 'PL' read group value. Must be one of CAPILLARY, DNBSEQ, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT, PACBIO, or SOLID." }

    reference_genome_fasta: {
      description: "After denovo assembly, large contigs are scaffolded against a reference genome to determine orientation and to join contigs together, before further polishing by reads. You must supply at least one reference genome (all segments/chromomes in a single fasta file). If more than one reference is provided, contigs will be scaffolded against all of them and the one with the most complete assembly will be chosen for downstream polishing.",
      patterns: ["*.fasta"]
    }
    deplete_bmtaggerDbs: {
       description: "Optional list of databases to use for bmtagger-based depletion. Sequences in fasta format will be indexed on the fly, pre-bmtagger-indexed databases may be provided as tarballs.",
       patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    deplete_blastDbs: {
      description: "Optional list of databases to use for blastn-based depletion. Sequences in fasta format will be indexed on the fly, pre-blast-indexed databases may be provided as tarballs.",
      patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    deplete_bwaDbs: {
      description: "Optional list of databases to use for bwa mem-based depletion. Sequences in fasta format will be indexed on the fly, pre-bwa-indexed databases may be provided as tarballs.",
      patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    filter_to_taxon_db: {
      description: "Optional database to use to filter read set to those that match by LASTAL. Sequences in fasta format will be indexed on the fly.",
      patterns: ["*.fasta"]
    }
    spikein_db: {
      description: "ERCC/SDSI spike-in sequences",
      patterns: ["*.fasta", "*.fasta.gz", "*.fasta.zst"]
    }
    trim_clip_db: {
      description: "Adapter sequences to remove via trimmomatic prior to SPAdes assembly",
      patterns: ["*.fasta", "*.fasta.gz", "*.fasta.zst"]
    }
    kraken2_db_tgz: {
      description: "Pre-built Kraken database tarball containing three files: hash.k2d, opts.k2d, and taxo.k2d.",
      patterns: ["*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    krona_taxonomy_tab: {
      description: "Krona taxonomy database containing a single file: taxonomy.tab, or possibly just a compressed taxonomy.tab",
      patterns: ["*.tab.zst", "*.tab.gz", "*.tab", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    ncbi_taxdump_tgz: {
      description: "An NCBI taxdump.tar.gz file that contains, at the minimum, a nodes.dmp and names.dmp file.",
      patterns: ["*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
  }

  # bundle 1 or 2 input files along with their metadata
  call read_utils.FastqToUBAM {
    input:
      fastq_1 = fastq_1,
      fastq_2 = fastq_2,
      sample_name = sample_name,
      library_name = library_name,
      run_date = run_date_iso,
      platform_name = sequencing_platform
  }
  File reads_bam = FastqToUBAM.unmapped_bam

  # metagenomics, QC, etc
  call reports.fastqc as fastqc_raw {
      input: reads_bam = reads_bam
  }
  if(defined(spikein_db)) {
    call reports.align_and_count as spikein {
        input:
            reads_bam = reads_bam,
            ref_db    = select_first([spikein_db])
    }
  }
  call metagenomics.kraken2 as kraken2 {
      input:
          reads_bam             = reads_bam,
          kraken2_db_tgz        = kraken2_db_tgz,
          krona_taxonomy_db_tgz = krona_taxonomy_tab
  }

  # deplete host reads: kraken2 followed by (optional) bmtagger, blastn, and/or bwa
  # resulting read set can be published to SRA
  call metagenomics.filter_bam_to_taxa as deplete_k2 {
      input:
          classified_bam          = reads_bam,
          classified_reads_txt_gz = kraken2.kraken2_reads_report,
          ncbi_taxonomy_db_tgz    = ncbi_taxdump_tgz,
          exclude_taxa            = true,
          taxonomic_names         = taxa_to_dehost,
          out_filename_suffix     = "host_depleted"
  }
  if(length(deplete_bmtaggerDbs) + length(deplete_blastDbs) + length(deplete_bwaDbs) > 0) {
    call taxon_filter.deplete_taxa {
      input:
        raw_reads_unmapped_bam = deplete_k2.bam_filtered_to_taxa,
        bmtaggerDbs            = deplete_bmtaggerDbs,
        blastDbs               = deplete_blastDbs,
        bwaDbs                 = deplete_bwaDbs
    }
  }
  File dehosted_bam = select_first([deplete_taxa.cleaned_bam, deplete_k2.bam_filtered_to_taxa])
  call reports.fastqc as fastqc_dehosted {
      input: reads_bam = dehosted_bam
  }

  # taxonomically focus/filter read set: remove Bacterial and artificial taxa (kraken2) and optionally filter to desired LASTAL database
  call metagenomics.filter_bam_to_taxa as filter_acellular {
      input:
          classified_bam          = dehosted_bam,
          classified_reads_txt_gz = kraken2.kraken2_reads_report,
          ncbi_taxonomy_db_tgz    = ncbi_taxdump_tgz,
          exclude_taxa            = true,
          taxonomic_names         = taxa_to_avoid_assembly,
          out_filename_suffix     = "acellular"
  }
  if(defined(filter_to_taxon_db)) {
    call taxon_filter.filter_to_taxon {
      input:
        reads_unmapped_bam = dehosted_bam,
        lastal_db_fasta    = select_first([filter_to_taxon_db])
    }
  }
  File taxfiltered_bam = select_first([filter_to_taxon.taxfilt_bam, dehosted_bam])
  call reports.fastqc as fastqc_taxfilt {
      input: reads_bam = taxfiltered_bam
  }

  # alignment-free duplicate removal
  call read_utils.rmdup_ubam {
    input:
      reads_unmapped_bam = taxfiltered_bam
  }

  # denovo assembly (with taxfiltered/rmdup reads)
  call assembly.assemble {
    input:
      reads_unmapped_bam = rmdup_ubam.dedup_bam,
      trim_clip_db       = trim_clip_db,
      always_succeed     = true,
      sample_name        = sample_name
  }

  # scaffold contigs to one or more ref genomes and impute with reference
  call assembly.scaffold {
    input:
      contigs_fasta           = assemble.contigs_fasta,
      reads_bam               = dehosted_bam,
      sample_name             = sample_name,
      reference_genome_fasta  = reference_genome_fasta
  }

  # polish/refine with dehosted reads
  call assemble_refbased.assemble_refbased as refine {
      input:
          reads_unmapped_bams = [dehosted_bam],
          reference_fasta     = scaffold.scaffold_fasta,
          sample_name         = sample_name
  }

  output {
    File    final_assembly_fasta                  = refine.assembly_fasta
    File    aligned_only_reads_bam                = refine.align_to_self_merged_aligned_only_bam
    File    coverage_plot                         = refine.align_to_self_merged_coverage_plot
    Int     assembly_length                       = refine.assembly_length
    Int     assembly_length_unambiguous           = refine.assembly_length_unambiguous
    Int     reads_aligned                         = refine.align_to_self_merged_reads_aligned
    Float   mean_coverage                         = refine.align_to_self_merged_mean_coverage

    File    kraken2_summary_report                = kraken2.kraken2_summary_report
    File    kraken2_krona_plot                    = kraken2.krona_report_html

    File    raw_unmapped_bam                      = reads_bam
    File    depleted_bam                          = dehosted_bam
    File    taxfilt_bam                           = taxfiltered_bam
    File    dedup_bam                             = rmdup_ubam.dedup_bam
    File    denovo_in_bam                         = assemble.subsampBam
        
    Int     read_counts_raw                       = deplete_k2.classified_taxonomic_filter_read_count_pre
    Int     read_counts_depleted                  = select_first([deplete_taxa.depletion_read_count_post, deplete_k2.classified_taxonomic_filter_read_count_post])
    Int     read_counts_taxfilt                   = select_first([filter_to_taxon.filter_read_count_post, filter_acellular.classified_taxonomic_filter_read_count_post])
    Int     read_counts_dedup                     = rmdup_ubam.dedup_read_count_post
    Int     read_counts_denovo_in                 = assemble.subsample_read_count

    File    raw_fastqc                            = fastqc_raw.fastqc_html
    File    depleted_fastqc                       = fastqc_dehosted.fastqc_html
    File    taxfilt_fastqc                        = fastqc_taxfilt.fastqc_html
    File    dedup_fastqc                          = rmdup_ubam.dedup_fastqc
    
    File    contigs_fasta                         = assemble.contigs_fasta
    File    scaffold_fasta                        = scaffold.scaffold_fasta
    File    intermediate_scaffold_fasta           = scaffold.intermediate_scaffold_fasta
    File    intermediate_gapfill_fasta            = scaffold.intermediate_gapfill_fasta
    Int     assembly_preimpute_length             = scaffold.assembly_preimpute_length
    Int     assembly_preimpute_length_unambiguous = scaffold.assembly_preimpute_length_unambiguous
    Array[String]  scaffolding_chosen_ref_names   = scaffold.scaffolding_chosen_ref_names
    File    scaffolding_stats                     = scaffold.scaffolding_stats
    File    scaffolding_alt_contigs               = scaffold.scaffolding_alt_contigs

    Int     replicate_concordant_sites            = refine.replicate_concordant_sites
    Int     replicate_discordant_snps             = refine.replicate_discordant_snps
    Int     replicate_discordant_indels           = refine.replicate_discordant_indels
    Int     num_read_groups                       = refine.num_read_groups
    Int     num_libraries                         = refine.num_libraries
    File    replicate_discordant_vcf              = refine.replicate_discordant_vcf

    File    isnvs_vcf                             = refine.align_to_self_isnvs_vcf
    
    File    aligned_bam                           = refine.align_to_self_merged_aligned_only_bam
    File    aligned_only_reads_fastqc             = refine.align_to_ref_fastqc
    File    coverage_tsv                          = refine.align_to_self_merged_coverage_tsv
    Int     read_pairs_aligned                    = refine.align_to_self_merged_read_pairs_aligned
    Float   bases_aligned                         = refine.align_to_self_merged_bases_aligned

    File?   spikein_hits                          = spikein.report
    String? spikein_tophit                        = spikein.top_hit_id
    String? spikein_pct_of_total_reads            = spikein.pct_total_reads_mapped
    String? spikein_pct_lesser_hits               = spikein.pct_lesser_hits_of_mapped

    String  viral_classify_version                = kraken2.viralngs_version
    String  viral_assemble_version                = assemble.viralngs_version
  }
}

version 1.0 import "../tasks/tasks_taxon_filter.wdl" as taxon_filter import "../tasks/tasks_read_utils.wdl" as read_utils import "../tasks/tasks_assembly.wdl" as assembly import "../tasks/tasks_metagenomics.wdl" as metagenomics import "../tasks/tasks_reports.wdl" as reports import "assemble_refbased.wdl" as assemble_refbased workflow metagenomic_denovo { meta { description: "Assisted de novo viral genome assembly (SPAdes, scaffolding, and polishing) from metagenomic raw reads. Runs raw reads through taxonomic classification (Kraken2), human read depletion (based on Kraken2 and optionally using BWA, BLASTN, and/or BMTAGGER databases), and FASTQC/multiQC of reads." author: "Broad Viral Genomics" email: "viral-ngs@broadinstitute.org" allowNestedInputs: true } input { String sample_name File fastq_1 File? fastq_2 String library_name="1" String? run_date_iso String sequencing_platform Array[File]+ reference_genome_fasta File kraken2_db_tgz File krona_taxonomy_tab File ncbi_taxdump_tgz Array[File] deplete_bmtaggerDbs = [] Array[File] deplete_blastDbs = [] Array[File] deplete_bwaDbs =[] Array[String] taxa_to_dehost = ["Vertebrata"] Array[String] taxa_to_avoid_assembly = ["Vertebrata", "other sequences", "Bacteria"] File? filter_to_taxon_db File? spikein_db File trim_clip_db } parameter_meta { fastq_1: { description: "Unaligned read1 file in fastq format", patterns: ["*.fastq", "*.fastq.gz", "*.fq", "*.fq.gz"] } fastq_2: { description: "Unaligned read2 file in fastq format. This should be empty for single-end read conversion and required for paired-end reads. If provided, it must match fastq_1 in length and order.", patterns: ["*.fastq", "*.fastq.gz", "*.fq", "*.fq.gz"] } sample_name: { description: "Sample name. This is required and will populate the 'SM' read group value and will be used as the output filename (must be filename-friendly)." } sequencing_platform: { description: "Sequencing platform. This is required and will populate the 'PL' read group value. Must be one of CAPILLARY, DNBSEQ, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT, PACBIO, or SOLID." } reference_genome_fasta: { description: "After denovo assembly, large contigs are scaffolded against a reference genome to determine orientation and to join contigs together, before further polishing by reads. You must supply at least one reference genome (all segments/chromomes in a single fasta file). If more than one reference is provided, contigs will be scaffolded against all of them and the one with the most complete assembly will be chosen for downstream polishing.", patterns: ["*.fasta"] } deplete_bmtaggerDbs: { description: "Optional list of databases to use for bmtagger-based depletion. Sequences in fasta format will be indexed on the fly, pre-bmtagger-indexed databases may be provided as tarballs.", patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"] } deplete_blastDbs: { description: "Optional list of databases to use for blastn-based depletion. Sequences in fasta format will be indexed on the fly, pre-blast-indexed databases may be provided as tarballs.", patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"] } deplete_bwaDbs: { description: "Optional list of databases to use for bwa mem-based depletion. Sequences in fasta format will be indexed on the fly, pre-bwa-indexed databases may be provided as tarballs.", patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"] } filter_to_taxon_db: { description: "Optional database to use to filter read set to those that match by LASTAL. Sequences in fasta format will be indexed on the fly.", patterns: ["*.fasta"] } spikein_db: { description: "ERCC/SDSI spike-in sequences", patterns: ["*.fasta", "*.fasta.gz", "*.fasta.zst"] } trim_clip_db: { description: "Adapter sequences to remove via trimmomatic prior to SPAdes assembly", patterns: ["*.fasta", "*.fasta.gz", "*.fasta.zst"] } kraken2_db_tgz: { description: "Pre-built Kraken database tarball containing three files: hash.k2d, opts.k2d, and taxo.k2d.", patterns: ["*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"] } krona_taxonomy_tab: { description: "Krona taxonomy database containing a single file: taxonomy.tab, or possibly just a compressed taxonomy.tab", patterns: ["*.tab.zst", "*.tab.gz", "*.tab", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"] } ncbi_taxdump_tgz: { description: "An NCBI taxdump.tar.gz file that contains, at the minimum, a nodes.dmp and names.dmp file.", patterns: ["*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"] } } # bundle 1 or 2 input files along with their metadata call read_utils.FastqToUBAM { input: fastq_1 = fastq_1, fastq_2 = fastq_2, sample_name = sample_name, library_name = library_name, run_date = run_date_iso, platform_name = sequencing_platform } File reads_bam = FastqToUBAM.unmapped_bam # metagenomics, QC, etc call reports.fastqc as fastqc_raw { input: reads_bam = reads_bam } if(defined(spikein_db)) { call reports.align_and_count as spikein { input: reads_bam = reads_bam, ref_db = select_first([spikein_db]) } } call metagenomics.kraken2 as kraken2 { input: reads_bam = reads_bam, kraken2_db_tgz = kraken2_db_tgz, krona_taxonomy_db_tgz = krona_taxonomy_tab } # deplete host reads: kraken2 followed by (optional) bmtagger, blastn, and/or bwa # resulting read set can be published to SRA call metagenomics.filter_bam_to_taxa as deplete_k2 { input: classified_bam = reads_bam, classified_reads_txt_gz = kraken2.kraken2_reads_report, ncbi_taxonomy_db_tgz = ncbi_taxdump_tgz, exclude_taxa = true, taxonomic_names = taxa_to_dehost, out_filename_suffix = "host_depleted" } if(length(deplete_bmtaggerDbs) + length(deplete_blastDbs) + length(deplete_bwaDbs) > 0) { call taxon_filter.deplete_taxa { input: raw_reads_unmapped_bam = deplete_k2.bam_filtered_to_taxa, bmtaggerDbs = deplete_bmtaggerDbs, blastDbs = deplete_blastDbs, bwaDbs = deplete_bwaDbs } } File dehosted_bam = select_first([deplete_taxa.cleaned_bam, deplete_k2.bam_filtered_to_taxa]) call reports.fastqc as fastqc_dehosted { input: reads_bam = dehosted_bam } # taxonomically focus/filter read set: remove Bacterial and artificial taxa (kraken2) and optionally filter to desired LASTAL database call metagenomics.filter_bam_to_taxa as filter_acellular { input: classified_bam = dehosted_bam, classified_reads_txt_gz = kraken2.kraken2_reads_report, ncbi_taxonomy_db_tgz = ncbi_taxdump_tgz, exclude_taxa = true, taxonomic_names = taxa_to_avoid_assembly, out_filename_suffix = "acellular" } if(defined(filter_to_taxon_db)) { call taxon_filter.filter_to_taxon { input: reads_unmapped_bam = dehosted_bam, lastal_db_fasta = select_first([filter_to_taxon_db]) } } File taxfiltered_bam = select_first([filter_to_taxon.taxfilt_bam, dehosted_bam]) call reports.fastqc as fastqc_taxfilt { input: reads_bam = taxfiltered_bam } # alignment-free duplicate removal call read_utils.rmdup_ubam { input: reads_unmapped_bam = taxfiltered_bam } # denovo assembly (with taxfiltered/rmdup reads) call assembly.assemble { input: reads_unmapped_bam = rmdup_ubam.dedup_bam, trim_clip_db = trim_clip_db, always_succeed = true, sample_name = sample_name } # scaffold contigs to one or more ref genomes and impute with reference call assembly.scaffold { input: contigs_fasta = assemble.contigs_fasta, reads_bam = dehosted_bam, sample_name = sample_name, reference_genome_fasta = reference_genome_fasta } # polish/refine with dehosted reads call assemble_refbased.assemble_refbased as refine { input: reads_unmapped_bams = [dehosted_bam], reference_fasta = scaffold.scaffold_fasta, sample_name = sample_name } output { File final_assembly_fasta = refine.assembly_fasta File aligned_only_reads_bam = refine.align_to_self_merged_aligned_only_bam File coverage_plot = refine.align_to_self_merged_coverage_plot Int assembly_length = refine.assembly_length Int assembly_length_unambiguous = refine.assembly_length_unambiguous Int reads_aligned = refine.align_to_self_merged_reads_aligned Float mean_coverage = refine.align_to_self_merged_mean_coverage File kraken2_summary_report = kraken2.kraken2_summary_report File kraken2_krona_plot = kraken2.krona_report_html File raw_unmapped_bam = reads_bam File depleted_bam = dehosted_bam File taxfilt_bam = taxfiltered_bam File dedup_bam = rmdup_ubam.dedup_bam File denovo_in_bam = assemble.subsampBam Int read_counts_raw = deplete_k2.classified_taxonomic_filter_read_count_pre Int read_counts_depleted = select_first([deplete_taxa.depletion_read_count_post, deplete_k2.classified_taxonomic_filter_read_count_post]) Int read_counts_taxfilt = select_first([filter_to_taxon.filter_read_count_post, filter_acellular.classified_taxonomic_filter_read_count_post]) Int read_counts_dedup = rmdup_ubam.dedup_read_count_post Int read_counts_denovo_in = assemble.subsample_read_count File raw_fastqc = fastqc_raw.fastqc_html File depleted_fastqc = fastqc_dehosted.fastqc_html File taxfilt_fastqc = fastqc_taxfilt.fastqc_html File dedup_fastqc = rmdup_ubam.dedup_fastqc File contigs_fasta = assemble.contigs_fasta File scaffold_fasta = scaffold.scaffold_fasta File intermediate_scaffold_fasta = scaffold.intermediate_scaffold_fasta File intermediate_gapfill_fasta = scaffold.intermediate_gapfill_fasta Int assembly_preimpute_length = scaffold.assembly_preimpute_length Int assembly_preimpute_length_unambiguous = scaffold.assembly_preimpute_length_unambiguous Array[String] scaffolding_chosen_ref_names = scaffold.scaffolding_chosen_ref_names File scaffolding_stats = scaffold.scaffolding_stats File scaffolding_alt_contigs = scaffold.scaffolding_alt_contigs Int replicate_concordant_sites = refine.replicate_concordant_sites Int replicate_discordant_snps = refine.replicate_discordant_snps Int replicate_discordant_indels = refine.replicate_discordant_indels Int num_read_groups = refine.num_read_groups Int num_libraries = refine.num_libraries File replicate_discordant_vcf = refine.replicate_discordant_vcf File isnvs_vcf = refine.align_to_self_isnvs_vcf File aligned_bam = refine.align_to_self_merged_aligned_only_bam File aligned_only_reads_fastqc = refine.align_to_ref_fastqc File coverage_tsv = refine.align_to_self_merged_coverage_tsv Int read_pairs_aligned = refine.align_to_self_merged_read_pairs_aligned Float bases_aligned = refine.align_to_self_merged_bases_aligned File? spikein_hits = spikein.report String? spikein_tophit = spikein.top_hit_id String? spikein_pct_of_total_reads = spikein.pct_total_reads_mapped String? spikein_pct_lesser_hits = spikein.pct_lesser_hits_of_mapped String viral_classify_version = kraken2.viralngs_version String viral_assemble_version = assemble.viralngs_version } }

WORKFLOW metagenomic_denovo

Imports

Workflow: metagenomic_denovo

Inputs

Outputs

Calls

CALL TASKS FastqToUBAM ↗

CALL TASKS fastqc_raw ↗ → fastqc

CALL TASKS spikein ↗ → align_and_count

CALL TASKS kraken2 ↗

CALL TASKS deplete_k2 ↗ → filter_bam_to_taxa

CALL TASKS deplete_taxa ↗

CALL TASKS fastqc_dehosted ↗ → fastqc

CALL TASKS filter_acellular ↗ → filter_bam_to_taxa

CALL TASKS filter_to_taxon ↗

CALL TASKS fastqc_taxfilt ↗ → fastqc

CALL TASKS rmdup_ubam ↗

CALL TASKS assemble ↗

CALL TASKS scaffold ↗

CALL WORKFLOW refine ↗ → assemble_refbased

Images