classify_single
pipes/WDL/workflows/classify_single.wdl

WORKFLOW classify_single

File Path pipes/WDL/workflows/classify_single.wdl
WDL Version 1.0
Type workflow

Imports

Namespace Path
metagenomics ../tasks/tasks_metagenomics.wdl
read_utils ../tasks/tasks_read_utils.wdl
assembly ../tasks/tasks_assembly.wdl
ncbi ../tasks/tasks_ncbi.wdl
reports ../tasks/tasks_reports.wdl
utils ../tasks/tasks_utils.wdl

Workflow: classify_single

Runs raw reads through taxonomic classification (Kraken2), human read depletion (based on Kraken2), de novo assembly (SPAdes), and FASTQC/multiQC of reads.

Author: Broad Viral Genomics
viral-ngs@broadinstitute.org

Inputs

Name Type Description Default
reads_bams Array[File]+ Reads to classify. May be unmapped or mapped or both, paired-end or single-end. Multiple input files will be merged first. -
ncbi_taxdump_tgz File An NCBI taxdump.tar.gz file that contains, at the minimum, a nodes.dmp and names.dmp file. -
spikein_db File? ERCC spike-in sequences -
trim_clip_db File Adapter sequences to remove via trimmomatic prior to SPAdes assembly -
kraken2_db_tgz File Pre-built Kraken database tarball containing three files: hash.k2d, opts.k2d, and taxo.k2d. -
krona_taxonomy_db_kraken2_tgz File Krona taxonomy database containing a single file: taxonomy.tab, or possibly just a compressed taxonomy.tab -
taxid_to_ref_accessions_tsv File? - -
sample_name String? - -
reheader_table File? - -
machine_mem_gb Int? - -
min_base_qual Int? - -
taxonomic_ids Array[Int]? - -
minimum_hit_groups Int? - -
taxonomic_ids Array[Int]? - -
minimum_hit_groups Int? - -
spades_min_contig_len Int? - -
spades_options String? - -
machine_mem_gb Int? - -
emailAddress String - -
skani_m Int? - -
skani_s Int? - -
skani_c Int? - -
skani_n Int? - -
38 optional inputs with default values

Outputs

Name Type Expression
cleaned_reads_unaligned_bam File deplete.bam_filtered_to_taxa
deduplicated_reads_unaligned File rmdup_ubam.dedup_bam
contigs_fasta File spades.contigs_fasta
read_counts_raw Int deplete.classified_taxonomic_filter_read_count_pre
read_counts_depleted Int deplete.classified_taxonomic_filter_read_count_post
read_counts_dedup Int rmdup_ubam.dedup_read_count_post
read_counts_prespades_subsample Int spades.subsample_read_count
kraken2_summary_report File kraken2.kraken2_summary_report
kraken2_reads_report File kraken2.kraken2_reads_report
kraken2_krona_plot File kraken2.krona_report_html
kraken2_top_taxa_report File report_primary_kraken_taxa.ranked_focal_report
kraken2_focal_taxon_name String report_primary_kraken_taxa.focal_tax_name
kraken2_focal_total_reads Int report_primary_kraken_taxa.total_focal_reads
kraken2_top_taxon_id String report_primary_kraken_taxa.tax_id
kraken2_top_taxon_name String report_primary_kraken_taxa.tax_name
kraken2_top_taxon_rank String report_primary_kraken_taxa.tax_rank
kraken2_top_taxon_num_reads Int report_primary_kraken_taxa.num_reads
kraken2_top_taxon_pct_of_focal Float report_primary_kraken_taxa.percent_of_focal
skani_num_hits Int length(select_first([select_references.top_matches_per_cluster_basenames, []]))
skani_contigs_to_refs_dist_tsv File? select_references.skani_dist_full_tsv
skani_hits_taxids Array[Int]? skani_hit_taxid
skani_hits_taxnames Array[String]? skani_hit_taxname
raw_fastqc File? merge_raw_reads.fastqc
cleaned_fastqc File? deplete.fastqc_html_report
spikein_report File? spikein.report
spikein_tophit String? spikein.top_hit_id
spikein_pct_of_total_reads String? spikein.pct_total_reads_mapped
spikein_pct_lesser_hits String? spikein.pct_lesser_hits_of_mapped
kraken2_viral_classify_version String kraken2.viralngs_version
deplete_viral_classify_version String deplete.viralngs_version
spades_viral_assemble_version String spades.viralngs_version

Calls

This workflow calls the following tasks or subworkflows:

CALL TASKS merge_raw_reads → merge_and_reheader_bams

Input Mappings (1)
Input Value
in_bams reads_bams

CALL TASKS spikein → align_and_count

Input Mappings (2)
Input Value
reads_bam reads_bam
ref_db select_first([spikein_db])

CALL TASKS kraken2

Input Mappings (3)
Input Value
reads_bam reads_bam
kraken2_db_tgz kraken2_db_tgz
krona_taxonomy_db_tgz krona_taxonomy_db_kraken2_tgz

CALL TASKS deplete → filter_bam_to_taxa

Input Mappings (6)
Input Value
classified_bam reads_bam
classified_reads_txt_gz kraken2.kraken2_reads_report
ncbi_taxonomy_db_tgz ncbi_taxdump_tgz
exclude_taxa true
taxonomic_names taxa_to_dehost
out_filename_suffix "hs_depleted"

CALL TASKS filter_acellular → filter_bam_to_taxa

Input Mappings (6)
Input Value
classified_bam reads_bam
classified_reads_txt_gz kraken2.kraken2_reads_report
ncbi_taxonomy_db_tgz ncbi_taxdump_tgz
exclude_taxa true
taxonomic_names taxa_to_avoid_assembly
out_filename_suffix "acellular"

CALL TASKS rmdup_ubam

Input Mappings (1)
Input Value
reads_unmapped_bam filter_acellular.bam_filtered_to_taxa

CALL TASKS spades → assemble

Input Mappings (3)
Input Value
reads_unmapped_bam rmdup_ubam.dedup_bam
trim_clip_db trim_clip_db
always_succeed true

CALL TASKS report_primary_kraken_taxa

Input Mappings (1)
Input Value
kraken_summary_report kraken2.kraken2_summary_report

CALL TASKS string_split

Input Mappings (2)
Input Value
joined_string taxon[3]
delimiter ":"

CALL TASKS download_annotations

Input Mappings (2)
Input Value
accessions string_split.tokens
combined_out_prefix sub(taxon[3],":","-")

CALL TASKS select_references

Input Mappings (2)
Input Value
reference_genomes_fastas download_annotations.combined_fasta
contigs_fasta spades.contigs_fasta

CALL TASKS tax_lookup → fetch_row_from_tsv

Input Mappings (4)
Input Value
tsv select_first([taxid_to_ref_accessions_tsv])
idx_col "accessions"
idx_val sub(top_match,"-",":")
add_header ["taxid", "isolate_prefix", "taxname", "accessions"]

Images

Container images used by tasks in this workflow:

🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:
  • merge_raw_reads
  • rmdup_ubam
🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 4 tasks:
  • kraken2
  • deplete
  • filter_acellular
  • report_primary_kraken_taxa
🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:
  • spades
  • select_references
🐳 ~{docker}

~{docker}

Used by 1 task:
  • spikein
🐳 python

python:slim

Used by 2 tasks:
  • string_split
  • tax_lookup
🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 1 task:
  • download_annotations
← Back to Index

classify_single - Workflow Graph

🖱️ Scroll to zoom • Drag to pan • Double-click to reset • ESC to close

classify_single - WDL Source Code

version 1.0

import "../tasks/tasks_metagenomics.wdl" as metagenomics
import "../tasks/tasks_read_utils.wdl" as read_utils
import "../tasks/tasks_assembly.wdl" as assembly
import "../tasks/tasks_ncbi.wdl" as ncbi
import "../tasks/tasks_reports.wdl" as reports
import "../tasks/tasks_utils.wdl" as utils

workflow classify_single {
    meta {
         description: "Runs raw reads through taxonomic classification (Kraken2), human read depletion (based on Kraken2), de novo assembly (SPAdes), and FASTQC/multiQC of reads."
         author: "Broad Viral Genomics"
         email:  "viral-ngs@broadinstitute.org"
         allowNestedInputs: true
    }

    input {
        Array[File]+ reads_bams

        File ncbi_taxdump_tgz

        File? spikein_db
        File trim_clip_db

        File kraken2_db_tgz
        File krona_taxonomy_db_kraken2_tgz

        Array[String] taxa_to_dehost         = ["Vertebrata"]
        Array[String] taxa_to_avoid_assembly = ["Vertebrata", "other sequences", "Bacteria"]

        File?         taxid_to_ref_accessions_tsv
    }

    parameter_meta {
        reads_bams: {
          description: "Reads to classify. May be unmapped or mapped or both, paired-end or single-end. Multiple input files will be merged first.",
          patterns: ["*.bam"]
        }
        spikein_db: {
          description: "ERCC spike-in sequences",
          patterns: ["*.fasta", "*.fasta.gz", "*.fasta.zst"]
        }
        trim_clip_db: {
          description: "Adapter sequences to remove via trimmomatic prior to SPAdes assembly",
          patterns: ["*.fasta", "*.fasta.gz", "*.fasta.zst"]
        }
        kraken2_db_tgz: {
          description: "Pre-built Kraken database tarball containing three files: hash.k2d, opts.k2d, and taxo.k2d.",
          patterns: ["*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
        }
        krona_taxonomy_db_kraken2_tgz: {
          description: "Krona taxonomy database containing a single file: taxonomy.tab, or possibly just a compressed taxonomy.tab",
          patterns: ["*.tab.zst", "*.tab.gz", "*.tab", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
        }
        ncbi_taxdump_tgz: {
          description: "An NCBI taxdump.tar.gz file that contains, at the minimum, a nodes.dmp and names.dmp file.",
          patterns: ["*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
        }
        deduplicated_reads_unaligned: {
          description: "Deduplication on unaligned reads in BAM format using mvicuna or cdhit.",
          category: "other"
        }
        kraken2_krona_plot: {
          description:"Visualize the results of the Kraken2 analysis with Krona, which disaplys taxonmic hierarchiral data in multi-layerd pie.",
          category:"other"
        }
        kraken2_summary_report: {
          description: "Kraken report output file.",
          category: "other"
        }

    }

    call read_utils.merge_and_reheader_bams as merge_raw_reads {
        input:
            in_bams      = reads_bams
    }
    File reads_bam = merge_raw_reads.out_bam

    if(defined(spikein_db)) {
      call reports.align_and_count as spikein {
          input:
              reads_bam = reads_bam,
              ref_db    = select_first([spikein_db])
      }
    }
    call metagenomics.kraken2 as kraken2 {
        input:
            reads_bam             = reads_bam,
            kraken2_db_tgz        = kraken2_db_tgz,
            krona_taxonomy_db_tgz = krona_taxonomy_db_kraken2_tgz
    }
    call metagenomics.filter_bam_to_taxa as deplete {
        input:
            classified_bam          = reads_bam,
            classified_reads_txt_gz = kraken2.kraken2_reads_report,
            ncbi_taxonomy_db_tgz    = ncbi_taxdump_tgz,
            exclude_taxa            = true,
            taxonomic_names         = taxa_to_dehost,
            out_filename_suffix     = "hs_depleted"
    }
    call metagenomics.filter_bam_to_taxa as filter_acellular {
        input:
            classified_bam          = reads_bam,
            classified_reads_txt_gz = kraken2.kraken2_reads_report,
            ncbi_taxonomy_db_tgz    = ncbi_taxdump_tgz,
            exclude_taxa            = true,
            taxonomic_names         = taxa_to_avoid_assembly,
            out_filename_suffix     = "acellular"
    }
    call read_utils.rmdup_ubam {
       input:
            reads_unmapped_bam = filter_acellular.bam_filtered_to_taxa
    }
    call assembly.assemble as spades {
        input:
            reads_unmapped_bam = rmdup_ubam.dedup_bam,
            trim_clip_db       = trim_clip_db,
            always_succeed     = true
    }
    call metagenomics.report_primary_kraken_taxa {
        input:
            kraken_summary_report = kraken2.kraken2_summary_report
    }

    if(defined(taxid_to_ref_accessions_tsv)) {
      # download (multi-segment) genomes for each reference, fasta filename = colon-concatenated accession list
      scatter(taxon in read_tsv(select_first([taxid_to_ref_accessions_tsv]))) {
          # taxon = [taxid, isolate_prefix, taxname, semicolon_delim_accession_list]
          call utils.string_split {
              input:
                  joined_string = taxon[3],
                  delimiter = ":"
          }
          call ncbi.download_annotations {
              input:
                  accessions = string_split.tokens,
                  combined_out_prefix = sub(taxon[3], ":", "-")  # singularity does not like colons in filenames
          }
      }

      # subset reference genomes to those with ANI hits to contigs and cluster reference hits by any ANI similarity to each other
      call assembly.select_references {
          input:
              reference_genomes_fastas = download_annotations.combined_fasta,
              contigs_fasta = spades.contigs_fasta
      }

      # get taxid and taxname from taxid_to_ref_accessions_tsv
      scatter(top_match in select_references.top_matches_per_cluster_basenames) {
        call utils.fetch_row_from_tsv as tax_lookup {
          input:
              tsv = select_first([taxid_to_ref_accessions_tsv]),
              idx_col = "accessions",
              idx_val = sub(top_match, "-", ":"),
              add_header = ["taxid", "isolate_prefix", "taxname", "accessions"]
        }
        Int skani_hit_taxid = tax_lookup.map["taxid"]
        String skani_hit_taxname = tax_lookup.map["taxname"]
      }
    }


    output {
        File   cleaned_reads_unaligned_bam     = deplete.bam_filtered_to_taxa
        File   deduplicated_reads_unaligned    = rmdup_ubam.dedup_bam
        File   contigs_fasta                   = spades.contigs_fasta
        
        Int    read_counts_raw                 = deplete.classified_taxonomic_filter_read_count_pre
        Int    read_counts_depleted            = deplete.classified_taxonomic_filter_read_count_post
        Int    read_counts_dedup               = rmdup_ubam.dedup_read_count_post
        Int    read_counts_prespades_subsample = spades.subsample_read_count
        
        File   kraken2_summary_report          = kraken2.kraken2_summary_report
        File   kraken2_reads_report            = kraken2.kraken2_reads_report
        File   kraken2_krona_plot              = kraken2.krona_report_html
        File   kraken2_top_taxa_report         = report_primary_kraken_taxa.ranked_focal_report
        String kraken2_focal_taxon_name        = report_primary_kraken_taxa.focal_tax_name
        Int    kraken2_focal_total_reads       = report_primary_kraken_taxa.total_focal_reads
        String kraken2_top_taxon_id            = report_primary_kraken_taxa.tax_id
        String kraken2_top_taxon_name          = report_primary_kraken_taxa.tax_name
        String kraken2_top_taxon_rank          = report_primary_kraken_taxa.tax_rank
        Int    kraken2_top_taxon_num_reads     = report_primary_kraken_taxa.num_reads
        Float  kraken2_top_taxon_pct_of_focal  = report_primary_kraken_taxa.percent_of_focal

        Int    skani_num_hits                  = length(select_first([select_references.top_matches_per_cluster_basenames, []]))
        File?  skani_contigs_to_refs_dist_tsv  = select_references.skani_dist_full_tsv
        Array[Int]? skani_hits_taxids          = skani_hit_taxid
        Array[String]? skani_hits_taxnames     = skani_hit_taxname

        File?   raw_fastqc                      = merge_raw_reads.fastqc
        File?   cleaned_fastqc                  = deplete.fastqc_html_report
        File?   spikein_report                  = spikein.report
        String? spikein_tophit                  = spikein.top_hit_id
        String? spikein_pct_of_total_reads      = spikein.pct_total_reads_mapped
        String? spikein_pct_lesser_hits         = spikein.pct_lesser_hits_of_mapped
        
        String kraken2_viral_classify_version  = kraken2.viralngs_version
        String deplete_viral_classify_version  = deplete.viralngs_version
        String spades_viral_assemble_version   = spades.viralngs_version
    }
}