metagenomic_denovo
pipes/WDL/workflows/metagenomic_denovo.wdl

WORKFLOW metagenomic_denovo

File Path pipes/WDL/workflows/metagenomic_denovo.wdl
WDL Version 1.0
Type workflow

Imports

Namespace Path
taxon_filter ../tasks/tasks_taxon_filter.wdl
read_utils ../tasks/tasks_read_utils.wdl
assembly ../tasks/tasks_assembly.wdl
metagenomics ../tasks/tasks_metagenomics.wdl
reports ../tasks/tasks_reports.wdl
assemble_refbased assemble_refbased.wdl

Workflow: metagenomic_denovo

Assisted de novo viral genome assembly (SPAdes, scaffolding, and polishing) from metagenomic raw reads. Runs raw reads through taxonomic classification (Kraken2), human read depletion (based on Kraken2 and optionally using BWA, BLASTN, and/or BMTAGGER databases), and FASTQC/multiQC of reads.

Author: Broad Viral Genomics
viral-ngs@broadinstitute.org

Inputs

Name Type Description Default
sample_name String Sample name. This is required and will populate the 'SM' read group value and will be used as the output filename (must be filename-friendly). -
fastq_1 File Unaligned read1 file in fastq format -
fastq_2 File? Unaligned read2 file in fastq format. This should be empty for single-end read conversion and required for paired-end reads. If provided, it must match fastq_1 in length and order. -
run_date_iso String? - -
sequencing_platform String Sequencing platform. This is required and will populate the 'PL' read group value. Must be one of CAPILLARY, DNBSEQ, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT, PACBIO, or SOLID. -
reference_genome_fasta Array[File]+ After denovo assembly, large contigs are scaffolded against a reference genome to determine orientation and to join contigs together, before further polishing by reads. You must supply at least one reference genome (all segments/chromomes in a single fasta file). If more than one reference is provided, contigs will be scaffolded against all of them and the one with the most complete assembly will be chosen for downstream polishing. -
kraken2_db_tgz File Pre-built Kraken database tarball containing three files: hash.k2d, opts.k2d, and taxo.k2d. -
krona_taxonomy_tab File Krona taxonomy database containing a single file: taxonomy.tab, or possibly just a compressed taxonomy.tab -
ncbi_taxdump_tgz File An NCBI taxdump.tar.gz file that contains, at the minimum, a nodes.dmp and names.dmp file. -
filter_to_taxon_db File? Optional database to use to filter read set to those that match by LASTAL. Sequences in fasta format will be indexed on the fly. -
spikein_db File? ERCC/SDSI spike-in sequences -
trim_clip_db File Adapter sequences to remove via trimmomatic prior to SPAdes assembly -
readgroup_name String? - -
platform_unit String? - -
sequencing_center String? - -
additional_picard_options String? - -
machine_mem_gb Int? - -
min_base_qual Int? - -
taxonomic_ids Array[Int]? - -
minimum_hit_groups Int? - -
query_chunk_size Int? - -
taxonomic_ids Array[Int]? - -
minimum_hit_groups Int? - -
spades_min_contig_len Int? - -
spades_options String? - -
machine_mem_gb Int? - -
min_length_fraction Float? - -
min_unambig Float? - -
skani_m Int? - -
skani_s Int? - -
skani_c Int? - -
nucmer_max_gap Int? - -
nucmer_min_match Int? - -
nucmer_min_cluster Int? - -
scaffold_min_contig_len Int? - -
scaffold_min_pct_contig_aligned Float? - -
machine_mem_gb Int? - -
sample_original_name String? - -
novocraft_license File? - -
trim_coords_bed File? - -
machine_mem_gb Int? - -
min_keep_length Int? - -
sliding_window Int? - -
primer_offset Int? - -
machine_mem_gb Int? - -
reheader_table File? - -
amplicon_set String? - -
max_coverage_depth Int? - -
base_q_threshold Int? - -
mapping_q_threshold Int? - -
read_length_threshold Int? - -
plotXLimits String? - -
plotYLimits String? - -
machine_mem_gb Int? - -
reheader_table File? - -
max_coverage_depth Int? - -
base_q_threshold Int? - -
mapping_q_threshold Int? - -
read_length_threshold Int? - -
plotXLimits String? - -
plotYLimits String? - -
99 optional inputs with default values

Outputs

Name Type Expression
final_assembly_fasta File refine.assembly_fasta
aligned_only_reads_bam File refine.align_to_self_merged_aligned_only_bam
coverage_plot File refine.align_to_self_merged_coverage_plot
assembly_length Int refine.assembly_length
assembly_length_unambiguous Int refine.assembly_length_unambiguous
reads_aligned Int refine.align_to_self_merged_reads_aligned
mean_coverage Float refine.align_to_self_merged_mean_coverage
kraken2_summary_report File kraken2.kraken2_summary_report
kraken2_krona_plot File kraken2.krona_report_html
raw_unmapped_bam File reads_bam
depleted_bam File dehosted_bam
taxfilt_bam File taxfiltered_bam
dedup_bam File rmdup_ubam.dedup_bam
denovo_in_bam File assemble.subsampBam
read_counts_raw Int deplete_k2.classified_taxonomic_filter_read_count_pre
read_counts_depleted Int select_first([deplete_taxa.depletion_read_count_post, deplete_k2.classified_taxonomic_filter_read_count_post])
read_counts_taxfilt Int select_first([filter_to_taxon.filter_read_count_post, filter_acellular.classified_taxonomic_filter_read_count_post])
read_counts_dedup Int rmdup_ubam.dedup_read_count_post
read_counts_denovo_in Int assemble.subsample_read_count
raw_fastqc File fastqc_raw.fastqc_html
depleted_fastqc File fastqc_dehosted.fastqc_html
taxfilt_fastqc File fastqc_taxfilt.fastqc_html
dedup_fastqc File rmdup_ubam.dedup_fastqc
contigs_fasta File assemble.contigs_fasta
scaffold_fasta File scaffold.scaffold_fasta
intermediate_scaffold_fasta File scaffold.intermediate_scaffold_fasta
intermediate_gapfill_fasta File scaffold.intermediate_gapfill_fasta
assembly_preimpute_length Int scaffold.assembly_preimpute_length
assembly_preimpute_length_unambiguous Int scaffold.assembly_preimpute_length_unambiguous
scaffolding_chosen_ref_names Array[String] scaffold.scaffolding_chosen_ref_names
scaffolding_stats File scaffold.scaffolding_stats
scaffolding_alt_contigs File scaffold.scaffolding_alt_contigs
replicate_concordant_sites Int refine.replicate_concordant_sites
replicate_discordant_snps Int refine.replicate_discordant_snps
replicate_discordant_indels Int refine.replicate_discordant_indels
num_read_groups Int refine.num_read_groups
num_libraries Int refine.num_libraries
replicate_discordant_vcf File refine.replicate_discordant_vcf
isnvs_vcf File refine.align_to_self_isnvs_vcf
aligned_bam File refine.align_to_self_merged_aligned_only_bam
aligned_only_reads_fastqc File refine.align_to_ref_fastqc
coverage_tsv File refine.align_to_self_merged_coverage_tsv
read_pairs_aligned Int refine.align_to_self_merged_read_pairs_aligned
bases_aligned Float refine.align_to_self_merged_bases_aligned
spikein_hits File? spikein.report
spikein_tophit String? spikein.top_hit_id
spikein_pct_of_total_reads String? spikein.pct_total_reads_mapped
spikein_pct_lesser_hits String? spikein.pct_lesser_hits_of_mapped
viral_classify_version String kraken2.viralngs_version
viral_assemble_version String assemble.viralngs_version

Calls

This workflow calls the following tasks or subworkflows:

CALL TASKS FastqToUBAM

Input Mappings (6)
Input Value
fastq_1 fastq_1
fastq_2 fastq_2
sample_name sample_name
library_name library_name
run_date run_date_iso
platform_name sequencing_platform

CALL TASKS fastqc_raw → fastqc

Input Mappings (1)
Input Value
reads_bam reads_bam

CALL TASKS spikein → align_and_count

Input Mappings (2)
Input Value
reads_bam reads_bam
ref_db select_first([spikein_db])

CALL TASKS kraken2

Input Mappings (3)
Input Value
reads_bam reads_bam
kraken2_db_tgz kraken2_db_tgz
krona_taxonomy_db_tgz krona_taxonomy_tab

CALL TASKS deplete_k2 → filter_bam_to_taxa

Input Mappings (6)
Input Value
classified_bam reads_bam
classified_reads_txt_gz kraken2.kraken2_reads_report
ncbi_taxonomy_db_tgz ncbi_taxdump_tgz
exclude_taxa true
taxonomic_names taxa_to_dehost
out_filename_suffix "host_depleted"

CALL TASKS deplete_taxa

Input Mappings (4)
Input Value
raw_reads_unmapped_bam deplete_k2.bam_filtered_to_taxa
bmtaggerDbs deplete_bmtaggerDbs
blastDbs deplete_blastDbs
bwaDbs deplete_bwaDbs

CALL TASKS fastqc_dehosted → fastqc

Input Mappings (1)
Input Value
reads_bam dehosted_bam

CALL TASKS filter_acellular → filter_bam_to_taxa

Input Mappings (6)
Input Value
classified_bam dehosted_bam
classified_reads_txt_gz kraken2.kraken2_reads_report
ncbi_taxonomy_db_tgz ncbi_taxdump_tgz
exclude_taxa true
taxonomic_names taxa_to_avoid_assembly
out_filename_suffix "acellular"

CALL TASKS filter_to_taxon

Input Mappings (2)
Input Value
reads_unmapped_bam dehosted_bam
lastal_db_fasta select_first([filter_to_taxon_db])

CALL TASKS fastqc_taxfilt → fastqc

Input Mappings (1)
Input Value
reads_bam taxfiltered_bam

CALL TASKS rmdup_ubam

Input Mappings (1)
Input Value
reads_unmapped_bam taxfiltered_bam

CALL TASKS assemble

Input Mappings (4)
Input Value
reads_unmapped_bam rmdup_ubam.dedup_bam
trim_clip_db trim_clip_db
always_succeed true
sample_name sample_name

CALL TASKS scaffold

Input Mappings (4)
Input Value
contigs_fasta assemble.contigs_fasta
reads_bam dehosted_bam
sample_name sample_name
reference_genome_fasta reference_genome_fasta

CALL WORKFLOW refine → assemble_refbased

Input Mappings (3)
Input Value
reads_unmapped_bams [dehosted_bam]
reference_fasta scaffold.scaffold_fasta
sample_name sample_name

Images

Container images used by tasks in this workflow:

🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:
  • FastqToUBAM
  • rmdup_ubam
🐳 ~{docker}

~{docker}

Used by 4 tasks:
  • fastqc_raw
  • fastqc_dehosted
  • fastqc_taxfilt
  • spikein
🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 5 tasks:
  • kraken2
  • deplete_k2
  • filter_acellular
  • deplete_taxa
  • filter_to_taxon
🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:
  • assemble
  • scaffold
← Back to Index

metagenomic_denovo - Workflow Graph

🖱️ Scroll to zoom • Drag to pan • Double-click to reset • ESC to close

metagenomic_denovo - WDL Source Code

version 1.0

import "../tasks/tasks_taxon_filter.wdl" as taxon_filter
import "../tasks/tasks_read_utils.wdl" as read_utils
import "../tasks/tasks_assembly.wdl" as assembly
import "../tasks/tasks_metagenomics.wdl" as metagenomics
import "../tasks/tasks_reports.wdl" as reports
import "assemble_refbased.wdl" as assemble_refbased

workflow metagenomic_denovo {

  meta {
      description: "Assisted de novo viral genome assembly (SPAdes, scaffolding, and polishing) from metagenomic raw reads. Runs raw reads through taxonomic classification (Kraken2), human read depletion (based on Kraken2 and optionally using BWA, BLASTN, and/or BMTAGGER databases), and FASTQC/multiQC of reads."
      author: "Broad Viral Genomics"
      email:  "viral-ngs@broadinstitute.org"
      allowNestedInputs: true
  }

  input {
    String        sample_name
    File          fastq_1
    File?         fastq_2
    String        library_name="1"
    String?       run_date_iso
    String        sequencing_platform

    Array[File]+  reference_genome_fasta

    File          kraken2_db_tgz
    File          krona_taxonomy_tab
    File          ncbi_taxdump_tgz

    Array[File]   deplete_bmtaggerDbs = []
    Array[File]   deplete_blastDbs = []
    Array[File]   deplete_bwaDbs =[]

    Array[String] taxa_to_dehost = ["Vertebrata"]
    Array[String] taxa_to_avoid_assembly = ["Vertebrata", "other sequences", "Bacteria"]

    File?         filter_to_taxon_db
    File?         spikein_db

    File          trim_clip_db
  }

  parameter_meta {
    fastq_1: { description: "Unaligned read1 file in fastq format", patterns: ["*.fastq", "*.fastq.gz", "*.fq", "*.fq.gz"] }
    fastq_2: { description: "Unaligned read2 file in fastq format. This should be empty for single-end read conversion and required for paired-end reads. If provided, it must match fastq_1 in length and order.", patterns: ["*.fastq", "*.fastq.gz", "*.fq", "*.fq.gz"] }
    sample_name: { description: "Sample name. This is required and will populate the 'SM' read group value and will be used as the output filename (must be filename-friendly)." }
    sequencing_platform: { description: "Sequencing platform. This is required and will populate the 'PL' read group value. Must be one of CAPILLARY, DNBSEQ, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT, PACBIO, or SOLID." }

    reference_genome_fasta: {
      description: "After denovo assembly, large contigs are scaffolded against a reference genome to determine orientation and to join contigs together, before further polishing by reads. You must supply at least one reference genome (all segments/chromomes in a single fasta file). If more than one reference is provided, contigs will be scaffolded against all of them and the one with the most complete assembly will be chosen for downstream polishing.",
      patterns: ["*.fasta"]
    }
    deplete_bmtaggerDbs: {
       description: "Optional list of databases to use for bmtagger-based depletion. Sequences in fasta format will be indexed on the fly, pre-bmtagger-indexed databases may be provided as tarballs.",
       patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    deplete_blastDbs: {
      description: "Optional list of databases to use for blastn-based depletion. Sequences in fasta format will be indexed on the fly, pre-blast-indexed databases may be provided as tarballs.",
      patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    deplete_bwaDbs: {
      description: "Optional list of databases to use for bwa mem-based depletion. Sequences in fasta format will be indexed on the fly, pre-bwa-indexed databases may be provided as tarballs.",
      patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    filter_to_taxon_db: {
      description: "Optional database to use to filter read set to those that match by LASTAL. Sequences in fasta format will be indexed on the fly.",
      patterns: ["*.fasta"]
    }
    spikein_db: {
      description: "ERCC/SDSI spike-in sequences",
      patterns: ["*.fasta", "*.fasta.gz", "*.fasta.zst"]
    }
    trim_clip_db: {
      description: "Adapter sequences to remove via trimmomatic prior to SPAdes assembly",
      patterns: ["*.fasta", "*.fasta.gz", "*.fasta.zst"]
    }
    kraken2_db_tgz: {
      description: "Pre-built Kraken database tarball containing three files: hash.k2d, opts.k2d, and taxo.k2d.",
      patterns: ["*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    krona_taxonomy_tab: {
      description: "Krona taxonomy database containing a single file: taxonomy.tab, or possibly just a compressed taxonomy.tab",
      patterns: ["*.tab.zst", "*.tab.gz", "*.tab", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    ncbi_taxdump_tgz: {
      description: "An NCBI taxdump.tar.gz file that contains, at the minimum, a nodes.dmp and names.dmp file.",
      patterns: ["*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
  }

  # bundle 1 or 2 input files along with their metadata
  call read_utils.FastqToUBAM {
    input:
      fastq_1 = fastq_1,
      fastq_2 = fastq_2,
      sample_name = sample_name,
      library_name = library_name,
      run_date = run_date_iso,
      platform_name = sequencing_platform
  }
  File reads_bam = FastqToUBAM.unmapped_bam

  # metagenomics, QC, etc
  call reports.fastqc as fastqc_raw {
      input: reads_bam = reads_bam
  }
  if(defined(spikein_db)) {
    call reports.align_and_count as spikein {
        input:
            reads_bam = reads_bam,
            ref_db    = select_first([spikein_db])
    }
  }
  call metagenomics.kraken2 as kraken2 {
      input:
          reads_bam             = reads_bam,
          kraken2_db_tgz        = kraken2_db_tgz,
          krona_taxonomy_db_tgz = krona_taxonomy_tab
  }

  # deplete host reads: kraken2 followed by (optional) bmtagger, blastn, and/or bwa
  # resulting read set can be published to SRA
  call metagenomics.filter_bam_to_taxa as deplete_k2 {
      input:
          classified_bam          = reads_bam,
          classified_reads_txt_gz = kraken2.kraken2_reads_report,
          ncbi_taxonomy_db_tgz    = ncbi_taxdump_tgz,
          exclude_taxa            = true,
          taxonomic_names         = taxa_to_dehost,
          out_filename_suffix     = "host_depleted"
  }
  if(length(deplete_bmtaggerDbs) + length(deplete_blastDbs) + length(deplete_bwaDbs) > 0) {
    call taxon_filter.deplete_taxa {
      input:
        raw_reads_unmapped_bam = deplete_k2.bam_filtered_to_taxa,
        bmtaggerDbs            = deplete_bmtaggerDbs,
        blastDbs               = deplete_blastDbs,
        bwaDbs                 = deplete_bwaDbs
    }
  }
  File dehosted_bam = select_first([deplete_taxa.cleaned_bam, deplete_k2.bam_filtered_to_taxa])
  call reports.fastqc as fastqc_dehosted {
      input: reads_bam = dehosted_bam
  }

  # taxonomically focus/filter read set: remove Bacterial and artificial taxa (kraken2) and optionally filter to desired LASTAL database
  call metagenomics.filter_bam_to_taxa as filter_acellular {
      input:
          classified_bam          = dehosted_bam,
          classified_reads_txt_gz = kraken2.kraken2_reads_report,
          ncbi_taxonomy_db_tgz    = ncbi_taxdump_tgz,
          exclude_taxa            = true,
          taxonomic_names         = taxa_to_avoid_assembly,
          out_filename_suffix     = "acellular"
  }
  if(defined(filter_to_taxon_db)) {
    call taxon_filter.filter_to_taxon {
      input:
        reads_unmapped_bam = dehosted_bam,
        lastal_db_fasta    = select_first([filter_to_taxon_db])
    }
  }
  File taxfiltered_bam = select_first([filter_to_taxon.taxfilt_bam, dehosted_bam])
  call reports.fastqc as fastqc_taxfilt {
      input: reads_bam = taxfiltered_bam
  }

  # alignment-free duplicate removal
  call read_utils.rmdup_ubam {
    input:
      reads_unmapped_bam = taxfiltered_bam
  }

  # denovo assembly (with taxfiltered/rmdup reads)
  call assembly.assemble {
    input:
      reads_unmapped_bam = rmdup_ubam.dedup_bam,
      trim_clip_db       = trim_clip_db,
      always_succeed     = true,
      sample_name        = sample_name
  }

  # scaffold contigs to one or more ref genomes and impute with reference
  call assembly.scaffold {
    input:
      contigs_fasta           = assemble.contigs_fasta,
      reads_bam               = dehosted_bam,
      sample_name             = sample_name,
      reference_genome_fasta  = reference_genome_fasta
  }

  # polish/refine with dehosted reads
  call assemble_refbased.assemble_refbased as refine {
      input:
          reads_unmapped_bams = [dehosted_bam],
          reference_fasta     = scaffold.scaffold_fasta,
          sample_name         = sample_name
  }

  output {
    File    final_assembly_fasta                  = refine.assembly_fasta
    File    aligned_only_reads_bam                = refine.align_to_self_merged_aligned_only_bam
    File    coverage_plot                         = refine.align_to_self_merged_coverage_plot
    Int     assembly_length                       = refine.assembly_length
    Int     assembly_length_unambiguous           = refine.assembly_length_unambiguous
    Int     reads_aligned                         = refine.align_to_self_merged_reads_aligned
    Float   mean_coverage                         = refine.align_to_self_merged_mean_coverage

    File    kraken2_summary_report                = kraken2.kraken2_summary_report
    File    kraken2_krona_plot                    = kraken2.krona_report_html

    File    raw_unmapped_bam                      = reads_bam
    File    depleted_bam                          = dehosted_bam
    File    taxfilt_bam                           = taxfiltered_bam
    File    dedup_bam                             = rmdup_ubam.dedup_bam
    File    denovo_in_bam                         = assemble.subsampBam
        
    Int     read_counts_raw                       = deplete_k2.classified_taxonomic_filter_read_count_pre
    Int     read_counts_depleted                  = select_first([deplete_taxa.depletion_read_count_post, deplete_k2.classified_taxonomic_filter_read_count_post])
    Int     read_counts_taxfilt                   = select_first([filter_to_taxon.filter_read_count_post, filter_acellular.classified_taxonomic_filter_read_count_post])
    Int     read_counts_dedup                     = rmdup_ubam.dedup_read_count_post
    Int     read_counts_denovo_in                 = assemble.subsample_read_count

    File    raw_fastqc                            = fastqc_raw.fastqc_html
    File    depleted_fastqc                       = fastqc_dehosted.fastqc_html
    File    taxfilt_fastqc                        = fastqc_taxfilt.fastqc_html
    File    dedup_fastqc                          = rmdup_ubam.dedup_fastqc
    
    File    contigs_fasta                         = assemble.contigs_fasta
    File    scaffold_fasta                        = scaffold.scaffold_fasta
    File    intermediate_scaffold_fasta           = scaffold.intermediate_scaffold_fasta
    File    intermediate_gapfill_fasta            = scaffold.intermediate_gapfill_fasta
    Int     assembly_preimpute_length             = scaffold.assembly_preimpute_length
    Int     assembly_preimpute_length_unambiguous = scaffold.assembly_preimpute_length_unambiguous
    Array[String]  scaffolding_chosen_ref_names   = scaffold.scaffolding_chosen_ref_names
    File    scaffolding_stats                     = scaffold.scaffolding_stats
    File    scaffolding_alt_contigs               = scaffold.scaffolding_alt_contigs

    Int     replicate_concordant_sites            = refine.replicate_concordant_sites
    Int     replicate_discordant_snps             = refine.replicate_discordant_snps
    Int     replicate_discordant_indels           = refine.replicate_discordant_indels
    Int     num_read_groups                       = refine.num_read_groups
    Int     num_libraries                         = refine.num_libraries
    File    replicate_discordant_vcf              = refine.replicate_discordant_vcf

    File    isnvs_vcf                             = refine.align_to_self_isnvs_vcf
    
    File    aligned_bam                           = refine.align_to_self_merged_aligned_only_bam
    File    aligned_only_reads_fastqc             = refine.align_to_ref_fastqc
    File    coverage_tsv                          = refine.align_to_self_merged_coverage_tsv
    Int     read_pairs_aligned                    = refine.align_to_self_merged_read_pairs_aligned
    Float   bases_aligned                         = refine.align_to_self_merged_bases_aligned

    File?   spikein_hits                          = spikein.report
    String? spikein_tophit                        = spikein.top_hit_id
    String? spikein_pct_of_total_reads            = spikein.pct_total_reads_mapped
    String? spikein_pct_lesser_hits               = spikein.pct_lesser_hits_of_mapped

    String  viral_classify_version                = kraken2.viralngs_version
    String  viral_assemble_version                = assemble.viralngs_version
  }
}