assemble_denovo
pipes/WDL/workflows/assemble_denovo.wdl

WORKFLOW assemble_denovo

File Path pipes/WDL/workflows/assemble_denovo.wdl
WDL Version 1.0
Type workflow

Imports

Namespace Path
taxon_filter ../tasks/tasks_taxon_filter.wdl
read_utils ../tasks/tasks_read_utils.wdl
assembly ../tasks/tasks_assembly.wdl
ncbi ../tasks/tasks_ncbi.wdl
assemble_refbased assemble_refbased.wdl

Workflow: assemble_denovo

Assisted de novo viral genome assembly from raw reads.

Author: Broad Viral Genomics
viral-ngs@broadinstitute.org

Inputs

Name Type Description Default
reads_unmapped_bams Array[File]+ - -
reference_genome_fasta Array[File]+ After denovo assembly, large contigs are scaffolded against a reference genome to determine orientation and to join contigs together, before further polishing by reads. You must supply at least one reference genome (all segments/chromomes in a single fasta file). If more than one reference is provided, contigs will be scaffolded against all of them and the one with the most complete assembly will be chosen for downstream polishing. -
filter_to_taxon_db File? Optional database to use to filter read set to those that match by LASTAL. Sequences in fasta format will be indexed on the fly. -
trim_clip_db File - -
sample_original_name String? a (possibly filename-unfriendly) sample name for fasta and bam headers -
reheader_table File? - -
query_chunk_size Int? - -
sample_name String? - -
reheader_table File? - -
sample_name String? - -
reheader_table File? - -
sample_name String? - -
reheader_table File? - -
spades_min_contig_len Int? - -
spades_options String? - -
machine_mem_gb Int? - -
min_length_fraction Float? - -
min_unambig Float? - -
skani_m Int? - -
skani_s Int? - -
skani_c Int? - -
nucmer_max_gap Int? - -
nucmer_min_match Int? - -
nucmer_min_cluster Int? - -
scaffold_min_contig_len Int? - -
scaffold_min_pct_contig_aligned Float? - -
machine_mem_gb Int? - -
sample_original_name String? a (possibly filename-unfriendly) sample name for fasta and bam headers -
novocraft_license File? - -
trim_coords_bed File? - -
machine_mem_gb Int? - -
min_keep_length Int? - -
sliding_window Int? - -
primer_offset Int? - -
machine_mem_gb Int? - -
reheader_table File? - -
amplicon_set String? - -
max_coverage_depth Int? - -
base_q_threshold Int? - -
mapping_q_threshold Int? - -
read_length_threshold Int? - -
plotXLimits String? - -
plotYLimits String? - -
machine_mem_gb Int? - -
reheader_table File? - -
max_coverage_depth Int? - -
base_q_threshold Int? - -
mapping_q_threshold Int? - -
read_length_threshold Int? - -
plotXLimits String? - -
plotYLimits String? - -
92 optional inputs with default values

Outputs

Name Type Expression
final_assembly_fasta File select_first([rename_fasta_header.renamed_fasta, refine.assembly_fasta])
aligned_only_reads_bam File refine.align_to_self_merged_aligned_only_bam
coverage_plot File refine.align_to_self_merged_coverage_plot
assembly_length Int refine.assembly_length
assembly_length_unambiguous Int refine.assembly_length_unambiguous
reads_aligned Int refine.align_to_self_merged_reads_aligned
mean_coverage Float refine.align_to_self_merged_mean_coverage
cleaned_bam File merge_cleaned_reads.out_bam
cleaned_fastqc File? merge_cleaned_reads.fastqc
depletion_read_count_post Int merge_cleaned_reads.read_count
taxfilt_bam File merge_taxfilt_reads.out_bam
taxfilt_fastqc File? merge_taxfilt_reads.fastqc
filter_read_count_post Int merge_taxfilt_reads.read_count
dedup_bam File merge_dedup_reads.out_bam
dedup_fastqc File? merge_dedup_reads.fastqc
dedup_read_count_post Int merge_dedup_reads.read_count
contigs_fasta File assemble.contigs_fasta
subsampBam File assemble.subsampBam
subsample_read_count Int assemble.subsample_read_count
scaffold_fasta File scaffold.scaffold_fasta
intermediate_scaffold_fasta File scaffold.intermediate_scaffold_fasta
intermediate_gapfill_fasta File scaffold.intermediate_gapfill_fasta
assembly_preimpute_length Int scaffold.assembly_preimpute_length
assembly_preimpute_length_unambiguous Int scaffold.assembly_preimpute_length_unambiguous
scaffolding_chosen_ref_names Array[String] scaffold.scaffolding_chosen_ref_names
scaffolding_stats File scaffold.scaffolding_stats
scaffolding_alt_contigs File scaffold.scaffolding_alt_contigs
replicate_concordant_sites Int refine.replicate_concordant_sites
replicate_discordant_snps Int refine.replicate_discordant_snps
replicate_discordant_indels Int refine.replicate_discordant_indels
num_read_groups Int refine.num_read_groups
num_libraries Int refine.num_libraries
replicate_discordant_vcf File refine.replicate_discordant_vcf
isnvs_vcf File refine.align_to_self_isnvs_vcf
aligned_bam File refine.align_to_self_merged_aligned_only_bam
aligned_only_reads_fastqc File refine.align_to_ref_fastqc
coverage_tsv File refine.align_to_self_merged_coverage_tsv
read_pairs_aligned Int refine.align_to_self_merged_read_pairs_aligned
bases_aligned Int refine.align_to_self_merged_bases_aligned
assembly_method String "viral-ngs/assemble_denovo"
assemble_viral_assemble_version String assemble.viralngs_version
scaffold_viral_assemble_version String scaffold.viralngs_version

Calls

This workflow calls the following tasks or subworkflows:

CALL TASKS renamed_reads → merge_and_reheader_bams

Input Mappings (3)
Input Value
in_bams [reads_unmapped_bam]
sample_name sample_original_name
out_basename out_basename

CALL TASKS deplete_taxa

Input Mappings (4)
Input Value
raw_reads_unmapped_bam reads_unmapped_renamed_bams
bmtaggerDbs deplete_bmtaggerDbs
blastDbs deplete_blastDbs
bwaDbs deplete_bwaDbs

CALL TASKS filter_to_taxon

Input Mappings (2)
Input Value
reads_unmapped_bam reads_depleted_bams
lastal_db_fasta select_first([filter_to_taxon_db])

CALL TASKS rmdup_ubam

Input Mappings (1)
Input Value
reads_unmapped_bam reads_taxfilt_bams

CALL TASKS merge_dedup_reads → merge_and_reheader_bams

Input Mappings (2)
Input Value
in_bams rmdup_ubam.dedup_bam
out_basename out_basename

CALL TASKS merge_cleaned_reads → merge_and_reheader_bams

Input Mappings (2)
Input Value
in_bams reads_depleted_bams
out_basename out_basename

CALL TASKS merge_taxfilt_reads → merge_and_reheader_bams

Input Mappings (2)
Input Value
in_bams reads_taxfilt_bams
out_basename out_basename

CALL TASKS assemble

Input Mappings (4)
Input Value
reads_unmapped_bam merge_dedup_reads.out_bam
trim_clip_db trim_clip_db
always_succeed true
sample_name out_basename

CALL TASKS scaffold

Input Mappings (3)
Input Value
contigs_fasta assemble.contigs_fasta
reads_bam merge_dedup_reads.out_bam
reference_genome_fasta reference_genome_fasta

CALL WORKFLOW refine → assemble_refbased

Input Mappings (3)
Input Value
reads_unmapped_bams reads_depleted_bams
reference_fasta scaffold.scaffold_fasta
sample_name out_basename

CALL TASKS rename_fasta_header

Input Mappings (2)
Input Value
genome_fasta refine.assembly_fasta
new_name select_first([sample_original_name])

Images

Container images used by tasks in this workflow:

🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 6 tasks:
  • merge_dedup_reads
  • merge_cleaned_reads
  • merge_taxfilt_reads
  • rmdup_ubam
  • rename_fasta_header
  • renamed_reads
🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:
  • assemble
  • scaffold
🐳 Parameterized Image
⚙️ Parameterized

Configured via input:
docker

Used by 2 tasks:
  • deplete_taxa
  • filter_to_taxon
← Back to Index

assemble_denovo - Workflow Graph

🖱️ Scroll to zoom • Drag to pan • Double-click to reset • ESC to close

assemble_denovo - WDL Source Code

version 1.0

import "../tasks/tasks_taxon_filter.wdl" as taxon_filter
import "../tasks/tasks_read_utils.wdl" as read_utils
import "../tasks/tasks_assembly.wdl" as assembly
import "../tasks/tasks_ncbi.wdl" as ncbi
import "assemble_refbased.wdl" as assemble_refbased

workflow assemble_denovo {

  meta {
      description: "Assisted de novo viral genome assembly from raw reads."
      author: "Broad Viral Genomics"
      email:  "viral-ngs@broadinstitute.org"
      allowNestedInputs: true
  }

  input {
    Array[File]+ reads_unmapped_bams

    Array[File]+ reference_genome_fasta

    Array[File]  deplete_bmtaggerDbs = []
    Array[File]  deplete_blastDbs = []
    Array[File]  deplete_bwaDbs =[]

    File?        filter_to_taxon_db
    File         trim_clip_db

    String       out_basename = basename(basename(reads_unmapped_bams[0], ".bam"), ".cleaned")
    String?      sample_original_name
  }

  parameter_meta {
    raw_reads_unmapped_bams: { description: "unaligned reads in BAM format", patterns: ["*.bam"] }
    deplete_bmtaggerDbs: {
       description: "Optional list of databases to use for bmtagger-based depletion. Sequences in fasta format will be indexed on the fly, pre-bmtagger-indexed databases may be provided as tarballs.",
       patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    deplete_blastDbs: {
      description: "Optional list of databases to use for blastn-based depletion. Sequences in fasta format will be indexed on the fly, pre-blast-indexed databases may be provided as tarballs.",
      patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    deplete_bwaDbs: {
      description: "Optional list of databases to use for bwa mem-based depletion. Sequences in fasta format will be indexed on the fly, pre-bwa-indexed databases may be provided as tarballs.",
      patterns: ["*.fasta", "*.fasta.gz", "*.tar.gz", "*.tar.lz4", "*.tar.bz2", "*.tar.zst"]
    }
    filter_to_taxon_db: {
      description: "Optional database to use to filter read set to those that match by LASTAL. Sequences in fasta format will be indexed on the fly.",
      patterns: ["*.fasta"]
    }
    reference_genome_fasta: {
      description: "After denovo assembly, large contigs are scaffolded against a reference genome to determine orientation and to join contigs together, before further polishing by reads. You must supply at least one reference genome (all segments/chromomes in a single fasta file). If more than one reference is provided, contigs will be scaffolded against all of them and the one with the most complete assembly will be chosen for downstream polishing.",
      patterns: ["*.fasta"]
    }
    out_basename: { description: "a filename-friendly basename for output files" }
    sample_original_name: { description: "a (possibly filename-unfriendly) sample name for fasta and bam headers" }
  }

  # parallelize across provided input read files
  scatter(reads_unmapped_bam in reads_unmapped_bams) {

    # rename SM value in bam header if requested
    if(defined(sample_original_name)) {
      call read_utils.merge_and_reheader_bams as renamed_reads {
          input:
              in_bams      = [reads_unmapped_bam],
              sample_name  = sample_original_name,
              out_basename = out_basename
      }
    }
    File reads_unmapped_renamed_bams = select_first([renamed_reads.out_bam, reads_unmapped_bam])

    # deplete host if requested
    if(length(deplete_bmtaggerDbs) + length(deplete_blastDbs) + length(deplete_bwaDbs) > 0) {
      call taxon_filter.deplete_taxa {
        input:
          raw_reads_unmapped_bam = reads_unmapped_renamed_bams,
          bmtaggerDbs            = deplete_bmtaggerDbs,
          blastDbs               = deplete_blastDbs,
          bwaDbs                 = deplete_bwaDbs
      }
    }
    File reads_depleted_bams = select_first([deplete_taxa.cleaned_bam, reads_unmapped_bam])

    # select reads if requested
    if(defined(filter_to_taxon_db)) {
      call taxon_filter.filter_to_taxon {
        input:
          reads_unmapped_bam = reads_depleted_bams,
          lastal_db_fasta    = select_first([filter_to_taxon_db])
      }
    }
    File reads_taxfilt_bams = select_first([filter_to_taxon.taxfilt_bam, reads_depleted_bams])

    # alignment-free PCR duplicate removal
    call read_utils.rmdup_ubam {
      input:
        reads_unmapped_bam = reads_taxfilt_bams
    }
  }

  # merge all reads into single file
  call read_utils.merge_and_reheader_bams as merge_dedup_reads {
      input:
          in_bams      = rmdup_ubam.dedup_bam,
          out_basename = out_basename
  }
  call read_utils.merge_and_reheader_bams as merge_cleaned_reads {
      input:
          in_bams      = reads_depleted_bams,
          out_basename = out_basename
  }
  call read_utils.merge_and_reheader_bams as merge_taxfilt_reads {
      input:
          in_bams      = reads_taxfilt_bams,
          out_basename = out_basename
  }

  # denovo assembly pipeline below
  call assembly.assemble {
    input:
      reads_unmapped_bam = merge_dedup_reads.out_bam,
      trim_clip_db       = trim_clip_db,
      always_succeed     = true,
      sample_name        = out_basename
  }

  call assembly.scaffold {
    input:
      contigs_fasta           = assemble.contigs_fasta,
      reads_bam               = merge_dedup_reads.out_bam,
      reference_genome_fasta  = reference_genome_fasta
  }

  call assemble_refbased.assemble_refbased as refine {
      input:
          reads_unmapped_bams = reads_depleted_bams, # assemble_refbased will scatter on individual bams
          reference_fasta     = scaffold.scaffold_fasta,
          sample_name         = out_basename
  }

  if (defined(sample_original_name)) {
    call ncbi.rename_fasta_header {
      input:
        genome_fasta = refine.assembly_fasta,
        new_name     = select_first([sample_original_name])
    }
  }

  output {
    File    final_assembly_fasta                  = select_first([rename_fasta_header.renamed_fasta, refine.assembly_fasta])
    File    aligned_only_reads_bam                = refine.align_to_self_merged_aligned_only_bam
    File    coverage_plot                         = refine.align_to_self_merged_coverage_plot
    Int     assembly_length                       = refine.assembly_length
    Int     assembly_length_unambiguous           = refine.assembly_length_unambiguous
    Int     reads_aligned                         = refine.align_to_self_merged_reads_aligned
    Float   mean_coverage                         = refine.align_to_self_merged_mean_coverage
    
    File    cleaned_bam                           = merge_cleaned_reads.out_bam
    File?   cleaned_fastqc                        = merge_cleaned_reads.fastqc
    Int     depletion_read_count_post             = merge_cleaned_reads.read_count
    
    File    taxfilt_bam                           = merge_taxfilt_reads.out_bam
    File?   taxfilt_fastqc                        = merge_taxfilt_reads.fastqc
    Int     filter_read_count_post                = merge_taxfilt_reads.read_count
    
    File    dedup_bam                             = merge_dedup_reads.out_bam
    File?   dedup_fastqc                          = merge_dedup_reads.fastqc
    Int     dedup_read_count_post                 = merge_dedup_reads.read_count
    
    File    contigs_fasta                         = assemble.contigs_fasta
    File    subsampBam                            = assemble.subsampBam
    Int     subsample_read_count                  = assemble.subsample_read_count
    
    File    scaffold_fasta                        = scaffold.scaffold_fasta
    File    intermediate_scaffold_fasta           = scaffold.intermediate_scaffold_fasta
    File    intermediate_gapfill_fasta            = scaffold.intermediate_gapfill_fasta
    Int     assembly_preimpute_length             = scaffold.assembly_preimpute_length
    Int     assembly_preimpute_length_unambiguous = scaffold.assembly_preimpute_length_unambiguous
    Array[String]  scaffolding_chosen_ref_names   = scaffold.scaffolding_chosen_ref_names
    File    scaffolding_stats                     = scaffold.scaffolding_stats
    File    scaffolding_alt_contigs               = scaffold.scaffolding_alt_contigs

    Int     replicate_concordant_sites            = refine.replicate_concordant_sites
    Int     replicate_discordant_snps             = refine.replicate_discordant_snps
    Int     replicate_discordant_indels           = refine.replicate_discordant_indels
    Int     num_read_groups                       = refine.num_read_groups
    Int     num_libraries                         = refine.num_libraries
    File    replicate_discordant_vcf              = refine.replicate_discordant_vcf

    File    isnvs_vcf                             = refine.align_to_self_isnvs_vcf
    
    File    aligned_bam                           = refine.align_to_self_merged_aligned_only_bam
    File    aligned_only_reads_fastqc             = refine.align_to_ref_fastqc
    File    coverage_tsv                          = refine.align_to_self_merged_coverage_tsv
    Int     read_pairs_aligned                    = refine.align_to_self_merged_read_pairs_aligned
    Int     bases_aligned                         = refine.align_to_self_merged_bases_aligned
    
    String  assembly_method = "viral-ngs/assemble_denovo"
    String  assemble_viral_assemble_version       = assemble.viralngs_version
    String  scaffold_viral_assemble_version       = scaffold.viralngs_version
  }
}