SvAnna:
Efficient and accurate pathogenicity prediction for coding and regulatory structural variants in long-read genome sequencing.
SvAnna performs phenotype-driven prioritization of structural variants in VCF files, focusing specifically on long-read WGS analysis of germline variants.
Quickstart
This document is intended for the impatient users who want to quickly setup and prioritize variants with SvAnna.
Prerequisites
SvAnna is written in Java 11 and needs Java 11+ to be present in the runtime environment. Please verify that you are using Java 11+ by running:
$ java -version
If java
is present on your $PATH
, then the command above will print a message similar to this one:
openjdk version "11" 2018-09-25
OpenJDK Runtime Environment 18.9 (build 11+28)
OpenJDK 64-Bit Server VM 18.9 (build 11+28, mixed mode)
Setup
SvAnna is installed by running the following three steps.
1. Download SvAnna distribution ZIP
Download and extract SvAnna distribution ZIP archive from here.
Expand the Assets menu and download the svanna-cli-${project.version}-distribution.zip
. Choose the latest stable version,
or a release candidate (RC).
After unzipping the distribution archive, run the following command to display the help message:
$ java -jar svanna-cli-${project.version}.jar --help
Note
If things went OK, the command above will print the following help message:
Structural variant prioritization
Usage: svanna-cli.jar [-hV] [COMMAND]
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Commands:
prioritize Prioritize the variants.
See the full documentation at `https://svanna.readthedocs.io/en/master`
2. Download SvAnna database files
SvAnna database files are available for download in the Downloads section.
After the download, unzip the archive(s) content into a folder of your choice and note down the path:
$ unzip -d svanna-data *.svanna.zip
Prioritize structural variants in VCF file
Let’s annotate a toy VCF file containing eight SVs reported in the SvAnna manuscript. First, let’s download the VCF file from here:
$ wget https://raw.githubusercontent.com/TheJacksonLaboratory/SvAnna/master/svanna-cli/src/examples/example.vcf
The variants were sourced from published clinical case reports and presence of each variant results in a Mendelian disease.
For the purpose of this test run, let’s assume that the VCF file contains SVs identified in a short/long read sequencing run of a patient presenting with the following clinical symptoms:
HP:0011890 - Prolonged bleeding following procedure
HP:0000978 - Bruising susceptibility
HP:0012147 - Reduced quantity of Von Willebrand factor
Now, let’s prioritize the variants:
$ java -jar svanna-cli-${project.version}.jar prioritize -d svanna-data --output-format html,csv,vcf --vcf example.vcf --phenotype-term HP:0011890 --phenotype-term HP:0000978 --phenotype-term HP:0012147
The variant Othman-2010-20696945-VWF-index-FigS7
disrupts a promoter of the von Willenbrand factor
(VWF) gene (Othman et al., 2010).
The variant receives the highest \(PSV\) score of 47.26, and it is ranked first.
SvAnna stores prioritization results in HTML, CSV, and VCF output formats in the current working directory.
Setting up SvAnna
SvAnna is a desktop Java application that requires several external files to run. This document explains how to download the external files and how to prepare SvAnna for running in the local system.
Note
SvAnna is written with Java version 11 and will run and compile under Java 11+.
Installation
To install SvAnna, you need to get SvAnna distribution ZIP archive that contains the executable JAR file, and SvAnna database files.
Prebuilt SvAnna executable
To download the executable SvAnna JAR file, go to the Releases section on the SvAnna GitHub page and download the latest SvAnna ZIP archive.
SvAnna database files
SvAnna database files are available for download in the Downloads section.
After the download, unzip the archive and put SvAnna database files into a folder of your choice:
$ unzip -d svanna-data *.svanna.zip
Note
From now on, we will use svanna-data
instead of spelling out the full path to SvAnna database files.
Build SvAnna from source
As an alternative to using prebuilt SvAnna JAR file, the SvAnna JAR file can also be built from Java sources.
SvAnna was written with Java version 11. Git and Java Development Kit version 11 or better are required for build.
Run the following commands to download SvAnna source code from GitHub repository and to build SvAnna JAR file:
$ git clone https://github.com/TheJacksonLaboratory/SvAnna
$ cd SvAnna
$ ./mvnw package
After the build, the JAR file is located at svanna-cli/target/svanna-cli-${project.version}.jar
:
$ java -jar svanna-cli/target/svanna-cli-${project.version}.jar --help
Note
From now on, we will use svanna-cli.jar
instead of spelling out the full path to the JAR file within your environment.
Run SvAnna
SvAnna is a command-line Java tool that runs with Java version 11 or higher.
In the examples below, we assume that svanna-cli.jar
points to the executable JAR file and
svanna-data
points to the data directory we created in the Setting up SvAnna section.
Prioritization of structural variants
SvAnna provides prioritize
command for performing phenotype-driven prioritization of structural variants (SVs) stored in
VCF format. The prioritized variants are stored in one or more Output formats.
To prioritize variants in the example.vcf file (an example VCF file with 8 variants stored in SvAnna repository), run:
$ java -jar svanna-cli.jar prioritize -d svanna-data --vcf example.vcf --phenotype-term HP:0011890 --phenotype-term HP:0000978 --phenotype-term HP:0012147 --out-dir results --prefix example
After the run, the results are stored at results/example.html
.
Mandatory arguments
All CLI arguments for the prioritize
command are supplied as options (no positional parameters).
There is one mandatory option:
-d | --data-directory
- path to SvAnna data directory.
Analysis input
The input data can be specified in two ways: either as a path to a VCF file along with one or more HPO terms, or as a phenopacket:
-p | --phenopacket
- path to a phenopacket file. We support v1 and v2 schemas and the file can be in JSON, YAML, or protobuf binary format.-t | --phenotype-term
- HPO term describing clinical condition of the proband, may be specified multiple times (e.g.--term HP:1234567 --term HP:9876543
).--vcf
- path to the input VCF file.
Note
In case path to a VCF file is provided both in phenopacket and via --vcf
option, the --vcf
option has a precedence.
Optional parameters
SvAnna allows to fine-tune the prioritization using a number of optional parameters. For clarity, we group the options into several groups:
Run options
--frequency-threshold
- threshold for labeling SVs in population variant databases pv as common. If query SV v overlaps with pv that has frequency above the threshold, then v is considered to be common. The value is provided as a percentage (default1
).--overlap-threshold
- threshold to determine if a SV matches a variant from the population variant databases. The value is provided as a percentage (default80
).--min-read-support
- minimum number of reads supporting the presence of the alt allele required to include a variant into the analysis (default 3`).--n-threads
- number of threads used to prioritize the SVs (default2
).
Output options
--no-breakends
- do not report breakends/translocations in the HTML report (default:false
).--output-format
- comma separated list of output formats to use for writing the results (defaulthtml
).
Note
See Output formats section for more details.
--out-dir
- path to a folder where to write the output files (default: current working directory).--prefix
- prefix for output files (default: based on the input VCF name).--report-top-variants
- include top n variants in the HTML report (default:100
).
Note
Beware, the HTML report becomes rather large when including large number of variants.
--uncompressed-output
- the tabular and VCF output files are compressed by default. Use this flag if you want to disable compressing the output files (default:false
).
SvAnna configuration
--term-similarity-measure
- phenotype term similarity measure, use one of{RESNIK_SYMMETRIC, RESNIK_ASYMETRIC}
(default: RESNIK_SYMMETRIC).--ic-mica-mode
- the mode for getting information content of the most informative common ancestors for terms \(t_1\), and \(t_2\). Use one of{DATABASE, IN_MEMORY}
(default:DATABASE
).--promoter-length
- number of bases pre-pended to a transcript and evaluated as a promoter region (default:2000
).--promoter-fitness-gain
- set to0.
to score the promoter variants as strictly as coding variants or to1.
to completely disregard the promoter variants (default:0.6
).-v
- set logging output granularity. The option can be set multiple times (e.g.-vv
) to increase logging output.
See the next section to learn more about the SvAnna Output formats, and the Examples section to see how SvAnna prioritizes various SV classes.
Output formats
SvAnna supports storing results in 4 output formats: HTML, VCF CSV, and TSV. Use the --output-format
option
to select one or more of the desired output formats (e.g. --output-format html,vcf
).
HTML output format
SvAnna creates an HTML file with the analysis summary and with variants sorted by the \(PSV\) score
in descending order.
By default, top 100 variants are included into the report. The number of the reported variants can be adjusted by
the --report-top-variants
option.
The report consists of several parts:
Analysis summary - Details of HPO terms of the proband, paths of the input files, and the analysis parameters.
Variant counts - Breakdown of the number of the variant types of the different categories.
Prioritized SVs - Visualizations of the prioritized variants.
Note
Only the variants that passed all the filters are visualized in the Prioritized SVs section.
The --no-breakends
option excludes breakends/translocations from the report.
VCF output format
When including vcf
into the --output-format
option, a VCF file with all input variants is created.
The prioritization adds a novel INFO field to each variant:
PSV
- an INFO field containing \(PSV\) score for the variant.
Note
--report-top-variants
option has no effect for the VCF output format.Add
--uncompressed-output
flag if you want to get uncompressed VCF file.
CSV/TSV output format
To write the prioritization results into a CSV (or TSV) file, use csv
(tsv
) in the --output-format
option.
The results are written into a tabular file with the following columns:
contig - name of the contig/chromosome (e.g.
1
,2
,X
).start - 0-based start coordinate (excluded) of the variant on positive strand.
end - 0-based end coordinate (included) of the variant on positive strand.
id - variant ID as it was present in the input VCF file.
vtype - variant type, one of
{DEL, DUP, INV, INS, BND, CNV}
.failed_filters - the names of filters that the variant failed to pass. The names are separated by semicolon (
;
) *filter
- the variant failed previous VCF filters - at least one filter flag is present in the variant VCF line, except forPASS
. *coverage
- the variant is supported by less reads than specified by--min-read-support
option.psv - the \(PSV\) score value.
contig |
start |
end |
id |
vtype |
failed_filters |
psv |
---|---|---|---|---|---|---|
11 |
31130456 |
31671718 |
abcd |
DEL |
109.75766900764305 |
|
18 |
46962113 |
46969912 |
efgh |
DUP |
filter;coverage |
3.2 |
… |
… |
… |
… |
… |
… |
… |
Note
--report-top-variants
option has no effect for the CSV and TSV output formats.Add
--uncompressed-output
flag if you want to get uncompressed VCF file.
Examples
This section shows how SvAnna prioritizes various structural variant classes. The resulting HTML reports contain graphics that are reported in the supplement of SvAnna paper.
The examples work with variants stored in examples.vcf file. The VCF file is stored in SvAnna GitHub repository.
Use the run_examples.sh script to generate HTML reports for all cases described below. Note that you must enter
the paths to SvAnna JAR file, data directory, and the examples.vcf
into the script before running.
Single exon deletion
A deletion of 6.93 kb (chr17:31,150,798-31,157,725del
) affecting NF1 that was assigned a PSV score of 124.98.
The deletion affects exon 2 of several NF1 transcripts.
Pathogenic variants in NF1 are associated with neurofibromatosis type 1 (OMIM:162200
).
The phenotypic features curated for the proband UAB-1
were:
HP:0007565
Multiple cafe-au-lait spotsHP:0009732
Plexiform neurofibromaHP:0009735
Spinal neurofibromasHP:0009736
Tibial pseudarthrosis
Data were curated from a published case report in Decoding NF1 Intragenic Copy-Number Variations.
Command
$ java -jar svanna-cli.jar prioritize -d path/to/svanna-data --vcf example.vcf --term HP:0007565 --term HP:0009732 --term HP:0009735 --term HP:0009736
Deletion of multiple exons
A deletion of 10.26 kb (chr17:43,100,079-43,110,335del
) affecting BRCA1 that was assigned a PSV score of 272.91.
The deletion affects three BRCA1 exons. Pathogenic variants in BRCA1 are associated with
Breast-ovarian cancer, familial, 1 (OMIM:604370
).
The phenotypic feature curated for this case was:
HP:0003002
Breast carcinoma
Data were curated from a published case report The first case report of a large deletion of the BRCA1 gene in Croatia.
Command
$ java -jar svanna-cli.jar prioritize -d path/to/svanna-data --vcf example.vcf --term HP:0003002
Deletion of multiple genes
Deletion of 481.73 kb (chr2:109,923,337-110,405,062del
) affecting MALL, NPHP1, and MTLN
that was assigned a PSV score of 16.41.
Pathogenic variants in NPHP1 are associated with Joubert syndrome 4 (OMIM:609583
).
The phenotypic features curated for this case were:
HP:0003774
Stage 5 chronic kidney diseaseHP:0001320
Cerebellar vermis hypoplasiaHP:0002078
Truncal ataxiaHP:0000618
BlindnessHP:0000508
PtosisHP:0002419
Molar tooth sign on MRIHP:0011933
Elongated superior cerebellar peduncleHP:0002070
Limb ataxiaHP:0000543
Optic disc pallorHP:0000589
Coloboma
Data were curated from a published case report Whole-exome sequencing and digital PCR identified a novel compound heterozygous mutation in the NPHP1 gene in a case of Joubert syndrome and related disorders.
Command
$ java -jar svanna-cli.jar prioritize -d path/to/svanna-data --vcf example.vcf --term HP:0003774 --term HP:0001320 --term HP:0002078 --term HP:0000618 --term HP:0000508 --term HP:0002419 --term HP:0011933 --term HP:0002070 --term HP:0000543 --term HP:0000589
Duplication of coding sequence
Duplication of 36 bp (chr13:72835296-72835332dup
) affecting PIBF1 that was assigned a PSV score of 3.29.
Pathogenic variants in PIBF1 are associated with Joubert syndrome 33 (OMIM:617767
).
The phenotypic features curated for this case were:
HP:0032417
Periglomerular fibrosisHP:0000076
Vesicoureteral refluxHP:0002079
Hypoplasia of the corpus callosumHP:0001541
AscitesHP:0000540
HypermetropiaHP:0011968
Feeding difficultiesHP:0001250
SeizureHP:0000490
Deeply set eyeHP:0001263
Global developmental delayHP:0001284
AreflexiaHP:0002240
HepatomegalyHP:0001290
Generalized hypotoniaHP:0031200
Hyaline castsHP:0011800
Midface retrusionHP:0000090
NephronophthisisHP:0000092
Renal tubular atrophyHP:0001919
Acute kidney injuryHP:0012650
Perisylvian polymicrogyriaHP:0002419
Molar tooth sign on MRIHP:0002119
VentriculomegalyHP:0000105
Enlarged kidney
Data were curated from a published case report A biallelic 36-bp insertion in PIBF1 is associated with Joubert syndrome
Command
$ java -jar svanna-cli.jar prioritize -d path/to/svanna-data --vcf example.vcf --term HP:0032417 --term HP:0000076 --term HP:0002079 --term HP:0001541 --term HP:0000540 --term HP:0011968 --term HP:0001250 --term HP:0000490 --term HP:0001263 --term HP:0001284 --term HP:0002240 --term HP:0001290 --term HP:0031200 --term HP:0011800 --term HP:0000090 --term HP:0000092 --term HP:0001919 --term HP:0012650 --term HP:0002419 --term HP:0002119 --term HP:0000105
Multigene inversion
Inversion of ~12.23 kb (inv(chr3)(9725702; 9737931)
) that disrupts the coding sequence of BRPF1 was assigned
PSV score of 8.01.
Pathogenic variants in BRPF1 are associated with Intellectual developmental disorder with dysmorphic facies and ptosis OMIM:617333
.
The phenotypic features curated for this case were:
HP:0000316
HypertelorismHP:0000494
Downslanted palpebral fissuresHP:0000431
Wide nasal bridgeHP:0000286
EpicanthusHP:0000311
Round faceHP:0012368
Flat faceHP:0000486
StrabismusHP:0000508
PtosisHP:0002949
Fused cervical vertebraeHP:0002194
Delayed gross motor developmentHP:0000750
Delayed speech and language developmentHP:0002342
Intellectual disability, moderateHP:0011150
Myoclonic absence seizureHP:0002069
Bilateral tonic-clonic seizureHP:0001252
Hypotonia
Data were curated from a published case report Pathogenic 12-kb copy-neutral inversion in syndromic intellectual disability identified by high-fidelity long-read sequencing
Command
$ java -jar svanna-cli.jar prioritize -d path/to/svanna-data --vcf example.vcf --term HP:0000286 --term HP:0002069 --term HP:0000494 --term HP:0002342 --term HP:0000486 --term HP:0000750 --term HP:0000431 --term HP:0001252 --term HP:0002194 --term HP:0012368 --term HP:0011150 --term HP:0002949 --term HP:0000508 --term HP:0000316 --term HP:0000311
Deletion affecting transcription start site
Deletion of ∼1.57 kb (chrX:64,205,190-64,206,761del
) affecting transcription start site of AMER1 was assigned
PSV score of 9.05.
Pathogenic variants in AMER1 are associated with Osteopathia striata with cranial sclerosis (OMIM:300373
).
The phenotypic features curated for this case were:
HP:0001561
PolyhydramniosHP:0002684
Thickened calvariaHP:0000256
MacrocephalyHP:0000316
HypertelorismHP:0031367
Metaphyseal striationsHP:0002744
Bilateral cleft lip and palateHP:0002781
Upper airway obstructionHP:0001004
LymphedemaHP:0000750
Delayed speech and language development
Data were curated from a published case report Deletion of Exon 1 in AMER1 in Osteopathia Striata with Cranial Sclerosis.
Command
$ java -jar svanna-cli.jar prioritize -d path/to/svanna-data --vcf example.vcf --term HP:0001561 --term HP:0000750 --term HP:0002684 --term HP:0002781 --term HP:0000316 --term HP:0031367 --term HP:0002744 --term HP:0000256 --term HP:0001004
Deletion affecting promoter region
A deletion of 13 bp (chr12:6,124,705-6,124,718del
) located in the core promoter region of VWF was assigned PSV score of
47.26.
In the original publication, the deletion was shown to lead to aberrant binding of Ets transcription factors to the site of the deletion (30 bp upstream of ENST00000261405.10) and thereby reduce VWF expression.
Pathogenic variants in VWF are associated with von Willebrand disease (OMIM:193400
).
The phenotypic features curated for this case were:
HP:0011890
Prolonged bleeding following procedureHP:0000978
Bruising susceptibilityHP:0012147
Reduced quantity of Von Willebrand factor
Data were curated from a published case report Functional characterization of a 13-bp deletion (c.-1522_-1510del13) in the promoter of the von Willebrand factor gene in type 1 von Willebrand disease.
Command
$ java -jar svanna-cli.jar prioritize -d path/to/svanna-data --vcf example.vcf --term HP:0011890 --term HP:0000978 --term HP:0012147
Translocation disrupting a gene sequence
A translocation (t(chr3:11,007,014; chr4:139,383,334)
) affecting SLC6A1 was assigned PSV score of 4.51.
Pathogenic variants in SLC6A1 are associated with Myoclonic-atonic epilepsy (OMIM:616421
).
The phenotypic features curated for this case were:
HP:0000252
MicrocephalyHP:0000446
Narrow nasal bridgeHP:0000272
Malar flatteningHP:0000219
Thin upper lip vermilionHP:0000179
Thick lower lip vermilionHP:0002650
ScoliosisHP:0002987
Elbow flexion contractureHP:0006380
Knee flexion contractureHP:0001250
SeizureHP:0001263
Global developmental delayHP:0001276
Hypertonia
Data were curated from a published case report Phenotypic consequences of gene disruption by a balanced de novo translocation involving SLC6A1 and NAA15
Command
$ java -jar svanna-cli.jar prioritize -d path/to/svanna-data --vcf example.vcf --term HP:0000252 --term HP:0000446 --term HP:0000272 --term HP:0000219 --term HP:0000179 --term HP:0002650 --term HP:0002987 --term HP:0006380 --term HP:0001250 --term HP:0001263 --term HP:0001263 --term HP:0001276