WORKFLOW
genbank_single
File Path
pipes/WDL/workflows/genbank_single.wdl
WDL Version
1.0
Type
workflow
Imports
Namespace
Path
ncbi
../tasks/tasks_ncbi.wdl
ncbi_tools
../tasks/tasks_ncbi_tools.wdl
utils
../tasks/tasks_utils.wdl
Prepare assemblies for Genbank submission. This includes annotation by simple coordinate transfer from Genbank annotations and a multiple alignment. See https://viral-pipelines.readthedocs.io/en/latest/ncbi_submission.html for details.
Author: Broad Viral Genomics
Name
Type
Description
Default
assembly_fasta
File
Genome to prepare for Genbank submission. All segments/chromosomes included in one file. Must contain exactly the same number of sequences as reference_accessions.
-
aligned_bam
File?
Normally required: aligned BAM file to inspect for reporting sequencing platform, read depth, etc. in GenBank structured comments.
-
ref_accessions_colon_delim
String
Reference genome Genbank accessions, each segment/chromosome in the exact same count and order as the segments/chromosomes described in assemblies_fasta. List of accessions should be colon delimited.
-
biosample_accession
String
-
-
tax_id
Int
-
-
organism_name
String
-
-
email_address
String
-
-
authors_sbt
File
-
-
custom_ref_fsas
Array[File]?
-
-
custom_ref_tbls
Array[File]?
-
-
biosample_attributes_json
String?
-
-
biosample_attributes_tsv
File?
A post-submission attributes file from NCBI BioSample, which is available at https://submit.ncbi.nlm.nih.gov/subs/ and clicking on 'Download attributes file with BioSample accessions'.
-
taxdump_tgz
File
-
-
vadr_by_taxid_tsv
File
-
-
filter_to_ids
File?
-
-
isolate_prefix_override
String?
-
-
minlen
Int?
-
-
assembly_method
String
-
-
assembly_method_version
String
-
-
comment
String?
-
-
19 optional inputs with default values
assembly_id
String
Unique identifier for this assembly. Defaults to the basename of assembly_fasta. table2asn requires this value to be <=50 characters; see: https://www.ncbi.nlm.nih.gov/genbank/table2asn/#fsa
basename(basename(basename(assembly_fasta,".fasta"),".fsa"),".fa")
docker
String
-
"quay.io/broadinstitute/ncbi-tools:2.11.1"
docker
String
-
"quay.io/broadinstitute/viral-classify:2.5.1.0"
docker
String
-
"quay.io/broadinstitute/viral-phylo:2.5.1.0"
biosample_col_for_fasta_headers
String
-
"sample_name"
src_to_attr_map
Map[String,String]
-
{}
sanitize_seq_ids
Boolean
-
true
docker
String
-
"python:slim"
machine_mem_gb
Int
-
30
docker
String
-
"quay.io/broadinstitute/viral-phylo:2.5.1.0"
docker
String
-
"mirror.gcr.io/staphb/vadr:1.6.4"
cpus
Int
-
4
is_genome_assembly
Boolean
-
true
sanitize_ids
Boolean
-
true
docker
String
-
"quay.io/broadinstitute/viral-core:2.5.1"
mol_type
String
-
"cRNA"
genetic_code
Int
-
1
machine_mem_gb
Int
-
8
docker
String
-
"quay.io/broadinstitute/viral-phylo:2.5.1.0"
Outputs
Name
Type
Expression
genbank_mechanism
String
genbank_special_taxa.genbank_submission_mechanism
genbank_comment_file
File?
structured_comments_from_aligned_bam.structured_comment_file
genbank_source_table
File
biosample_to_genbank.genbank_source_modifier_table
genbank_isolate_name
String
biosample_to_genbank.isolate_name
annotation_tbl
File
feature_tbl
vadr_pass
Boolean?
vadr.pass
vadr_alerts
Array[String]
select_first([vadr.alerts, []])
genbank_submission_sqn
File?
table2asn.genbank_submission_sqn
genbank_preview_file
File?
table2asn.genbank_preview_file
table2asn_val_file
File?
table2asn.genbank_validation_file
table2asn_errors
Array[String]
select_first([table2asn.table2asn_errors, []])
table2asn_pass
Boolean?
table2asn.table2asn_passing
genbank_submit_files
Array[File]
submit_files
genbank_file_manifest
String
'{"submission_type": "~{genbank_special_taxa.genbank_submission_mechanism}", "validation_passing": ~{select_first([vadr.pass, true]) && select_first([table2asn.table2asn_passing, true])}, "files": ~{basename_list_json}}'
Calls
This workflow calls the following tasks or subworkflows:
Input Mappings (2)
Input
Value
json_data
select_first([biosample_attributes_json])
out_basename
"biosample_attributes-~{biosample_accession}"
Input Mappings (2)
Input
Value
biosample_ids
[biosample_accession]
out_basename
"biosample_attributes-~{biosample_accession}"
CALL
TASKS
assembly_fsa
↗
→ sanitize_fasta_headers
Input Mappings (2)
Input
Value
in_fasta
assembly_fasta
out_filename
assembly_id + ".fsa"
Input Mappings (2)
Input
Value
joined_string
ref_accessions_colon_delim
delimiter
":"
Input Mappings (3)
Input
Value
accessions
[segment_acc]
emailAddress
email_address
combined_out_prefix
segment_acc
Input Mappings (8)
Input
Value
biosample_attributes
biosample_attributes
num_segments
length(string_split.tokens)
taxid
tax_id
organism_name_override
organism_name
sequence_id_override
assembly_id
filter_to_accession
biosample_accession
out_basename
assembly_id
source_overrides_json
genbank_special_taxa.genbank_source_overrides_json
CALL
TASKS
annot
↗
→ align_and_annot_transfer_single
Input Mappings (4)
Input
Value
genome_fasta
assembly_fsa.sanitized_fasta
reference_fastas
select_first([custom_ref_fsas, flatten(download_annotations.genomes_fasta)])
reference_feature_tables
select_first([custom_ref_tbls, flatten(download_annotations.features_tbl)])
out_basename
assembly_id
Input Mappings (7)
Input
Value
genome_fasta
assembly_fsa.sanitized_fasta
maxlen
genbank_special_taxa.max_genome_length
vadr_opts
genbank_special_taxa.vadr_cli_options
vadr_model_tar
genbank_special_taxa.vadr_model_tar
vadr_model_tar_subdir
genbank_special_taxa.vadr_model_tar_subdir
mem_size
genbank_special_taxa.vadr_min_ram_gb
out_basename
assembly_id
Input Mappings (2)
Input
Value
out_basename
assembly_id
aligned_bam
select_first([aligned_bam])
Input Mappings (7)
Input
Value
assembly_fasta
assembly_fsa.sanitized_fasta
annotations_tbl
feature_tbl
source_modifier_table
biosample_to_genbank.genbank_source_modifier_table
structured_comment_file
structured_comments_from_aligned_bam.structured_comment_file
organism
organism_name
authors_sbt
authors_sbt
out_basename
assembly_id
Images
Container images used by tasks in this workflow:
⚙️ Parameterized
Configured via input:
docker
⚙️ Parameterized
Configured via input:
docker
python:slim
Used by 3 tasks:
string_split
biosample_to_genbank
biosample_json_to_tsv
⚙️ Parameterized
Configured via input:
docker
⚙️ Parameterized
Configured via input:
docker
Used by 3 tasks:
download_annotations
annot
table2asn
⚙️ Parameterized
Configured via input:
docker
⚙️ Parameterized
Configured via input:
docker
Used by 1 task:
structured_comments_from_aligned_bam
Zoom In
Zoom Out
Fit
Reset
🖱️ Scroll to zoom • Drag to pan • Double-click to reset • ESC to close
flowchart TD
Start([genbank_single])
subgraph C1 ["↔️ if defined(biosample_attributes_json)"]
direction TB
N1["biosample_json_to_tsvjson_dict_to_tsv "]
end
subgraph C2 ["↔️ if !defined(biosample_attributes_tsv) && !defined(biosample_attributes_json)"]
direction TB
N2["fetch_biosamples"]
end
N3["genbank_special_taxa"]
N4["assembly_fsasanitize_fasta_headers "]
N5["string_split"]
subgraph S1 ["🔃 scatter segment_acc in string_split.tokens"]
direction TB
N6["download_annotations"]
end
N7["biosample_to_genbank"]
subgraph C3 ["↔️ if !genbank_special_taxa.vadr_supported"]
direction TB
N8["annotalign_and_annot_transfer_single "]
end
subgraph C4 ["↔️ if genbank_special_taxa.vadr_supported"]
direction TB
N9["vadr"]
end
subgraph C5 ["↔️ if defined(aligned_bam)"]
direction TB
N10["structured_comments_from_aligned_bam"]
end
subgraph C6 ["↔️ if genbank_special_taxa.table2asn_allowed"]
direction TB
N11["table2asn"]
end
N5 --> N6
N1 --> N7
N5 --> N7
N2 --> N7
N3 --> N7
N4 --> N8
N3 --> N8
N6 --> N8
N4 --> N9
N3 --> N9
N4 --> N11
N7 --> N11
N8 --> N11
N10 --> N11
N3 --> N11
N9 --> N11
Start --> N1
Start --> N2
Start --> N3
Start --> N4
Start --> N5
Start --> N10
N11 --> End([End])
classDef taskNode fill:#a371f7,stroke:#8b5cf6,stroke-width:2px,color:#fff
classDef workflowNode fill:#58a6ff,stroke:#1f6feb,stroke-width:2px,color:#fff
version 1.0
import "../tasks/tasks_ncbi.wdl" as ncbi
import "../tasks/tasks_ncbi_tools.wdl" as ncbi_tools
import "../tasks/tasks_utils.wdl" as utils
workflow genbank_single {
meta {
description: "Prepare assemblies for Genbank submission. This includes annotation by simple coordinate transfer from Genbank annotations and a multiple alignment. See https://viral-pipelines.readthedocs.io/en/latest/ncbi_submission.html for details."
author: "Broad Viral Genomics"
email: "viral-ngs@broadinstitute.org"
allowNestedInputs: true
}
input {
File assembly_fasta
String assembly_id = basename(basename(basename(assembly_fasta, ".fasta"), ".fsa") , ".fa")
File? aligned_bam
String ref_accessions_colon_delim
String biosample_accession
Int tax_id
String organism_name
String email_address # required for fetching data from NCBI APIs
File authors_sbt
Array[File]? custom_ref_fsas
Array[File]? custom_ref_tbls
String? biosample_attributes_json # if this is used, we will use this first
File? biosample_attributes_tsv # if no json, we will read this tsv
# if both are unspecified, we will fetch from NCBI via biosample_accession
}
parameter_meta {
assembly_fasta: {
description: "Genome to prepare for Genbank submission. All segments/chromosomes included in one file. Must contain exactly the same number of sequences as reference_accessions.",
patterns: ["*.fasta"]
}
assembly_id: {
description: "Unique identifier for this assembly. Defaults to the basename of assembly_fasta. table2asn requires this value to be <=50 characters; see: https://www.ncbi.nlm.nih.gov/genbank/table2asn/#fsa",
patterns: ["^[A-Za-z0-9\-_\.:\*#]{1,50}$"]
}
ref_accessions_colon_delim: {
description: "Reference genome Genbank accessions, each segment/chromosome in the exact same count and order as the segments/chromosomes described in assemblies_fasta. List of accessions should be colon delimited.",
patterns: ["*.fasta"]
}
aligned_bam: {
description: "Normally required: aligned BAM file to inspect for reporting sequencing platform, read depth, etc. in GenBank structured comments.",
patterns: ["*.bam","*.sam"]
}
biosample_attributes_tsv: {
description: "A post-submission attributes file from NCBI BioSample, which is available at https://submit.ncbi.nlm.nih.gov/subs/ and clicking on 'Download attributes file with BioSample accessions'.",
patterns: ["*.txt", "*.tsv"]
}
}
# fetch biosample metadata from NCBI if it's not given to us in tsv form
if(defined(biosample_attributes_json)) {
call utils.json_dict_to_tsv as biosample_json_to_tsv {
input:
json_data = select_first([biosample_attributes_json]),
out_basename = "biosample_attributes-~{biosample_accession}"
}
}
if(!defined(biosample_attributes_tsv) && !defined(biosample_attributes_json)) {
call ncbi_tools.fetch_biosamples {
input:
biosample_ids = [biosample_accession],
out_basename = "biosample_attributes-~{biosample_accession}"
}
}
File biosample_attributes = select_first([biosample_json_to_tsv.tsv, biosample_attributes_tsv, fetch_biosamples.biosample_attributes_tsv])
# Is this a special virus that NCBI handles differently?
call ncbi.genbank_special_taxa {
input:
taxid = tax_id
}
# Rename fasta and sanitize ids of special characters
call utils.sanitize_fasta_headers as assembly_fsa {
input:
in_fasta = assembly_fasta,
out_filename = assembly_id + ".fsa"
}
# Annotate genes
## fetch reference genome sequences and annoations
call utils.string_split {
input:
joined_string = ref_accessions_colon_delim,
delimiter = ":"
}
scatter(segment_acc in string_split.tokens) {
## scatter these calls in order to preserve original order
call ncbi.download_annotations {
input:
accessions = [segment_acc],
emailAddress = email_address,
combined_out_prefix = segment_acc
}
}
# create genbank source modifier table from biosample metadata
call ncbi.biosample_to_genbank {
input:
biosample_attributes = biosample_attributes,
num_segments = length(string_split.tokens),
taxid = tax_id,
organism_name_override = organism_name,
sequence_id_override = assembly_id,
filter_to_accession = biosample_accession,
out_basename = assembly_id,
source_overrides_json = genbank_special_taxa.genbank_source_overrides_json
}
## annotate genes, either by VADR or by naive coordinate liftover
if(!genbank_special_taxa.vadr_supported) {
call ncbi.align_and_annot_transfer_single as annot {
input:
genome_fasta = assembly_fsa.sanitized_fasta,
reference_fastas = select_first([custom_ref_fsas, flatten(download_annotations.genomes_fasta)]),
reference_feature_tables = select_first([custom_ref_tbls, flatten(download_annotations.features_tbl)]),
out_basename = assembly_id
}
}
if(genbank_special_taxa.vadr_supported) {
call ncbi.vadr {
input:
genome_fasta = assembly_fsa.sanitized_fasta,
maxlen = genbank_special_taxa.max_genome_length,
vadr_opts = genbank_special_taxa.vadr_cli_options,
vadr_model_tar = genbank_special_taxa.vadr_model_tar,
vadr_model_tar_subdir = genbank_special_taxa.vadr_model_tar_subdir,
mem_size = genbank_special_taxa.vadr_min_ram_gb,
out_basename = assembly_id
}
}
File feature_tbl = select_first([vadr.feature_tbl, annot.feature_tbl])
if(defined(aligned_bam)) {
call ncbi.structured_comments_from_aligned_bam {
input:
out_basename = assembly_id,
aligned_bam = select_first([aligned_bam])
}
}
if(genbank_special_taxa.table2asn_allowed) {
call ncbi.table2asn {
input:
assembly_fasta = assembly_fsa.sanitized_fasta,
annotations_tbl = feature_tbl,
source_modifier_table = biosample_to_genbank.genbank_source_modifier_table,
structured_comment_file = structured_comments_from_aligned_bam.structured_comment_file,
organism = organism_name,
authors_sbt = authors_sbt,
out_basename = assembly_id
}
}
if(!genbank_special_taxa.table2asn_allowed) {
Array[File] special_submit_files = select_all([assembly_fsa.sanitized_fasta,
structured_comments_from_aligned_bam.structured_comment_file,
biosample_to_genbank.genbank_source_modifier_table])
String special_basename_list = '["~{assembly_id}.fsa", "~{assembly_id}.cmt", "~{assembly_id}.src"]'
}
String basename_list_json = select_first([special_basename_list, '["~{assembly_id}.sqn"]'])
scatter(submit_file in select_all(flatten(select_all([[table2asn.genbank_submission_sqn], special_submit_files])))) {
File submit_files = submit_file
}
output {
String genbank_mechanism = genbank_special_taxa.genbank_submission_mechanism
File? genbank_comment_file = structured_comments_from_aligned_bam.structured_comment_file
File genbank_source_table = biosample_to_genbank.genbank_source_modifier_table
String genbank_isolate_name = biosample_to_genbank.isolate_name
File annotation_tbl = feature_tbl
Boolean? vadr_pass = vadr.pass
Array[String] vadr_alerts = select_first([vadr.alerts, []])
File? genbank_submission_sqn = table2asn.genbank_submission_sqn
File? genbank_preview_file = table2asn.genbank_preview_file
File? table2asn_val_file = table2asn.genbank_validation_file
Array[String] table2asn_errors = select_first([table2asn.table2asn_errors, []])
Boolean? table2asn_pass = table2asn.table2asn_passing
Array[File] genbank_submit_files = submit_files
String genbank_file_manifest = '{"submission_type": "~{genbank_special_taxa.genbank_submission_mechanism}", "validation_passing": ~{select_first([vadr.pass, true]) && select_first([table2asn.table2asn_passing, true])}, "files": ~{basename_list_json}}'
}
}
version 1.0
import "../tasks/tasks_ncbi.wdl" as ncbi
import "../tasks/tasks_ncbi_tools.wdl" as ncbi_tools
import "../tasks/tasks_utils.wdl" as utils
workflow genbank_single {
meta {
description: "Prepare assemblies for Genbank submission. This includes annotation by simple coordinate transfer from Genbank annotations and a multiple alignment. See https://viral-pipelines.readthedocs.io/en/latest/ncbi_submission.html for details."
author: "Broad Viral Genomics"
email: "viral-ngs@broadinstitute.org"
allowNestedInputs: true
}
input {
File assembly_fasta
String assembly_id = basename(basename(basename(assembly_fasta, ".fasta"), ".fsa") , ".fa")
File? aligned_bam
String ref_accessions_colon_delim
String biosample_accession
Int tax_id
String organism_name
String email_address # required for fetching data from NCBI APIs
File authors_sbt
Array[File]? custom_ref_fsas
Array[File]? custom_ref_tbls
String? biosample_attributes_json # if this is used, we will use this first
File? biosample_attributes_tsv # if no json, we will read this tsv
# if both are unspecified, we will fetch from NCBI via biosample_accession
}
parameter_meta {
assembly_fasta: {
description: "Genome to prepare for Genbank submission. All segments/chromosomes included in one file. Must contain exactly the same number of sequences as reference_accessions.",
patterns: ["*.fasta"]
}
assembly_id: {
description: "Unique identifier for this assembly. Defaults to the basename of assembly_fasta. table2asn requires this value to be <=50 characters; see: https://www.ncbi.nlm.nih.gov/genbank/table2asn/#fsa",
patterns: ["^[A-Za-z0-9\-_\.:\*#]{1,50}$"]
}
ref_accessions_colon_delim: {
description: "Reference genome Genbank accessions, each segment/chromosome in the exact same count and order as the segments/chromosomes described in assemblies_fasta. List of accessions should be colon delimited.",
patterns: ["*.fasta"]
}
aligned_bam: {
description: "Normally required: aligned BAM file to inspect for reporting sequencing platform, read depth, etc. in GenBank structured comments.",
patterns: ["*.bam","*.sam"]
}
biosample_attributes_tsv: {
description: "A post-submission attributes file from NCBI BioSample, which is available at https://submit.ncbi.nlm.nih.gov/subs/ and clicking on 'Download attributes file with BioSample accessions'.",
patterns: ["*.txt", "*.tsv"]
}
}
# fetch biosample metadata from NCBI if it's not given to us in tsv form
if(defined(biosample_attributes_json)) {
call utils.json_dict_to_tsv as biosample_json_to_tsv {
input:
json_data = select_first([biosample_attributes_json]),
out_basename = "biosample_attributes-~{biosample_accession}"
}
}
if(!defined(biosample_attributes_tsv) && !defined(biosample_attributes_json)) {
call ncbi_tools.fetch_biosamples {
input:
biosample_ids = [biosample_accession],
out_basename = "biosample_attributes-~{biosample_accession}"
}
}
File biosample_attributes = select_first([biosample_json_to_tsv.tsv, biosample_attributes_tsv, fetch_biosamples.biosample_attributes_tsv])
# Is this a special virus that NCBI handles differently?
call ncbi.genbank_special_taxa {
input:
taxid = tax_id
}
# Rename fasta and sanitize ids of special characters
call utils.sanitize_fasta_headers as assembly_fsa {
input:
in_fasta = assembly_fasta,
out_filename = assembly_id + ".fsa"
}
# Annotate genes
## fetch reference genome sequences and annoations
call utils.string_split {
input:
joined_string = ref_accessions_colon_delim,
delimiter = ":"
}
scatter(segment_acc in string_split.tokens) {
## scatter these calls in order to preserve original order
call ncbi.download_annotations {
input:
accessions = [segment_acc],
emailAddress = email_address,
combined_out_prefix = segment_acc
}
}
# create genbank source modifier table from biosample metadata
call ncbi.biosample_to_genbank {
input:
biosample_attributes = biosample_attributes,
num_segments = length(string_split.tokens),
taxid = tax_id,
organism_name_override = organism_name,
sequence_id_override = assembly_id,
filter_to_accession = biosample_accession,
out_basename = assembly_id,
source_overrides_json = genbank_special_taxa.genbank_source_overrides_json
}
## annotate genes, either by VADR or by naive coordinate liftover
if(!genbank_special_taxa.vadr_supported) {
call ncbi.align_and_annot_transfer_single as annot {
input:
genome_fasta = assembly_fsa.sanitized_fasta,
reference_fastas = select_first([custom_ref_fsas, flatten(download_annotations.genomes_fasta)]),
reference_feature_tables = select_first([custom_ref_tbls, flatten(download_annotations.features_tbl)]),
out_basename = assembly_id
}
}
if(genbank_special_taxa.vadr_supported) {
call ncbi.vadr {
input:
genome_fasta = assembly_fsa.sanitized_fasta,
maxlen = genbank_special_taxa.max_genome_length,
vadr_opts = genbank_special_taxa.vadr_cli_options,
vadr_model_tar = genbank_special_taxa.vadr_model_tar,
vadr_model_tar_subdir = genbank_special_taxa.vadr_model_tar_subdir,
mem_size = genbank_special_taxa.vadr_min_ram_gb,
out_basename = assembly_id
}
}
File feature_tbl = select_first([vadr.feature_tbl, annot.feature_tbl])
if(defined(aligned_bam)) {
call ncbi.structured_comments_from_aligned_bam {
input:
out_basename = assembly_id,
aligned_bam = select_first([aligned_bam])
}
}
if(genbank_special_taxa.table2asn_allowed) {
call ncbi.table2asn {
input:
assembly_fasta = assembly_fsa.sanitized_fasta,
annotations_tbl = feature_tbl,
source_modifier_table = biosample_to_genbank.genbank_source_modifier_table,
structured_comment_file = structured_comments_from_aligned_bam.structured_comment_file,
organism = organism_name,
authors_sbt = authors_sbt,
out_basename = assembly_id
}
}
if(!genbank_special_taxa.table2asn_allowed) {
Array[File] special_submit_files = select_all([assembly_fsa.sanitized_fasta,
structured_comments_from_aligned_bam.structured_comment_file,
biosample_to_genbank.genbank_source_modifier_table])
String special_basename_list = '["~{assembly_id}.fsa", "~{assembly_id}.cmt", "~{assembly_id}.src"]'
}
String basename_list_json = select_first([special_basename_list, '["~{assembly_id}.sqn"]'])
scatter(submit_file in select_all(flatten(select_all([[table2asn.genbank_submission_sqn], special_submit_files])))) {
File submit_files = submit_file
}
output {
String genbank_mechanism = genbank_special_taxa.genbank_submission_mechanism
File? genbank_comment_file = structured_comments_from_aligned_bam.structured_comment_file
File genbank_source_table = biosample_to_genbank.genbank_source_modifier_table
String genbank_isolate_name = biosample_to_genbank.isolate_name
File annotation_tbl = feature_tbl
Boolean? vadr_pass = vadr.pass
Array[String] vadr_alerts = select_first([vadr.alerts, []])
File? genbank_submission_sqn = table2asn.genbank_submission_sqn
File? genbank_preview_file = table2asn.genbank_preview_file
File? table2asn_val_file = table2asn.genbank_validation_file
Array[String] table2asn_errors = select_first([table2asn.table2asn_errors, []])
Boolean? table2asn_pass = table2asn.table2asn_passing
Array[File] genbank_submit_files = submit_files
String genbank_file_manifest = '{"submission_type": "~{genbank_special_taxa.genbank_submission_mechanism}", "validation_passing": ~{select_first([vadr.pass, true]) && select_first([table2asn.table2asn_passing, true])}, "files": ~{basename_list_json}}'
}
}