GenomeVIP¶
Welcome to the Genome Variant Investigation Plaform (“GenomeVIP”) on Read the Docs.
GenomeVIP is a web platform for performing variant discovery and annotation on Amazon’s Web Service (AWS) cloud or on local high-performance computing clusters.
Here you will find a user guide (work in progress) for GenomeVIP as well as some notes for building GenomeVIP from source.
Have feedback or comments? Please file an issue here.
Contents¶
Quick Start¶
Using Amazon-hosted GenomeVIP server¶
SHORTCUTS / TIPS
The AWS Management Console lists Shortcuts and Recently Viewed Services, Quick Starts, and AWS Services. It can be reached in two ways:
- Once logged in to your AWS account, select the menu item Services > Console Home, if that option appears; or
- click on the orange box icon in the upper left corner of the page.
The EC2 Dashboard can be reached in two ways:
- Once logged in to your AWS account, select the menu item Services > EC2, if that option appears; or
- click on the orange box icon in the upper left corner of the page, and then in section AWS Services below, in section “Compute”, select EC2.
The list of running instances can be reached from your EC2 Dashboard in two ways:
- in the left-hand panel, select the menu item Instances > Instances; or
- in the Resources section at the top of the page, click the link “Running Instances”.
The S3 Console can be reached in two ways:
- Once logged in to your AWS account, select the menu item Services > S3, if that option appears; or
- click on the orange box icon in the upper left corner of the page, and then in section AWS Services below, in section “Storage & Content Delivery”, select S3.
PROCEDURE
Selecting server image
- Log in to your AWS account and navigate to the EC2 Dashboard.
- In the left-hand panel, in the Images menu, click on “AMIs”.
- Locate the filter/search field and enter
genomevip
. Then in the adjacent menu to the left, select “Public images”. The table below will update automatically. - Under “AMI Name”, locate the GenomeVIP server images, and select one (the most recent version is recommended) by checking the corresponding checkbox in the left column. Security tip: verify that the image has Owner 785242596344. When ready to proceed to the image configuration step, click on Launch near the top of the page.
Configuring the server
Select an instance type (minimum recommendation: 1 or more vCPUs with 4 GiB memory, e.g.
m1.medium
).Click on item “3. Configure Instance” near the top of the page, and then set the following:
OPTION VALUE COMMENT Number of instances: 1 Network: EC2 Classic Virtual private cloud (VPC) should also work Availability Zone: No preference IMPORTANT: This generally much match those of any/all EBS data volumes requested to use IAM role: None Choice may vary depending on how your AWS account has been configured for your use Click on item “6. Configure Security Group” near the top of the page. (Items 4 and 5 are not necessary.) The suggested security group information is likely adequate. On the left-hand side farther down the page, click the button Add Rule, and then in the dropdown menu that appears, select HTTPS and leave the port range set at “443”, which GenomeVIP expects. Your server instance is currently usable by anyone; if more security is desired, modify the source IP address setting.
Click on item “7. Review” at the top of the page, and if the displayed information is in order, click on Launch in the bottom right corner of the page.
Selecting key pair. The dialog pop-up box allows you to use an existing, or create a new, key pair. Selecting “Process without a key pair” is sufficient when instantiating a GenomeVIP server simply for running analyses. (Developers will want to use a key pair instead.) Check the acknowledgment box, and then click on Launch Instances.
Launch Status. Review the information displayed, and then to view instances on your EC2 Dashboard, click on View Instances in the bottom right corner of the page.
Accessing the user interface
Navigate your browser to the list of running instances, locate the running instance of interest, and check the corresponding checkbox in the left column. The panel at the bottom of the web page now displays information about the instance.
In that lower panel, the “Description” tab should already be active (if not, click on it to make it active). Note the entry Public IP (hereafter referred to as <publicIP>), which you will need to access the GenomeVIP interface.
Point your web browser to
https://<publicIP>/~genomevip
to reach the GenomeVIP home page.Note: you may receive a warning notification (Your connection is not secure, <publicIP> uses an invalid security certificate, Error code: SEC_ERROR_UNKNOWN_USER, Can’t verify the identity of the website, etc.) because we have used a self-signed certificate. You will need to add an exception in order to proceed further. More information can be obtained from, e.g. Mozilla support.
Configuring computations: see General usage
Shutting down the server
All EC2 instances in “running” mode contribute to the charges billed to your account. After you submit your analysis, you may wish to change this mode (accessible from the “list of instances” page, under menu item Actions > Instances State):
- Stop: put the instance into sleep/hibernate mode
- Start: put the instance into running mode at a new address (see Step C above)
- Terminate: end the instance entirely
Using locally hosted GenomeVIP server¶
This section applies to installations on physical machines as well as on virtualization software products (e.g. VirtualBox).
PROCEDURE
Navigate your web browser to the appropriate site or address (e.g. https://192.168.57.1/~genomevip).
The actual address may vary depending on how the underlying web server running GenomeVIP was configured and, accordingly, where GenomeVIP was installed. Check with your administrator, if needed.
Configuring computations: see General usage.
Shutting down the server
This step is likely needed only in virtual machine deployments; refer to your virtualization software documentation.
General usage¶
The flowchart along the lefthand side can be used as a guide for the typical order for configuring a computation. Although the pages can be visited in any order, the dynamic content updates depending on the developing configuration. The display text and alerts will provide assistance.
Select Accounts
- For Amazon EC2/Cloud (i.e., AWS), generate a new SessionID (Option 3) or enter a previous SessionID (Option 1).
- Note: the SessionID provides user account type functionality, allowing multiple users to use the same server instance. SessionsIDs persists until the server is terminated or rebooted and enables you to submit additional computations to a running runtime instance.
- For local clusters, provide your login credentials to your local cluster.
- For Amazon EC2/Cloud (i.e., AWS), generate a new SessionID (Option 3) or enter a previous SessionID (Option 1).
Select Genomes
- Load names and locations of bam files, reference genomes, and index files (e.g. by uploading a file(s) from your computer, by pointing to a remote file, or by obtaining a remote directory listing):
- For AWS:
- EBS volumes: select the volumes you wish to use and the corresponding lists of files (and/or upload files). Files should be given as their full path on that particular volume. Click Apply lists when done.
- Path to file: enter a valid S3 path to a list file (e.g.
s3://bucket/path/to/listfile.txt
) and click Retrieve file - File upload: Browse... to a file on your computer and then click Upload File
- 1000 Genomes files: select one of the pre-formed list based on AWS’s 1000 Genomes mirror and then click Load list
- For local clusters:
- For AWS:
- Select samples and a reference.
- Double-click on the bam name to transfer it to the “Selected bams” box. A search box with live update is available.
- In the “Selected box” box, arrange the bams in order according to the desired study:
- somatic: matched pairs (in the order of tumor followed by normal)
- trio: triples (in the order of father, mother, child)
- Note: missing index files will be generated automatically at run time. If necessary, bam files will be sorted as well. You may opt to save a copy of such generated files to the storage location where the results of the computaions are sent.
- Load names and locations of bam files, reference genomes, and index files (e.g. by uploading a file(s) from your computer, by pointing to a remote file, or by obtaining a remote directory listing):
Execution Profile
This series of tabs allows users to configure computations at a high level across multiple tools or at a low level across individual tools. The collection of selected tools and corresponding parameters comprise the execution profile. There are three classes of tabs: 1) Quick Setup, 2) Individual tools, and 3) Post-discovery tools.
Quick Setup:
The four main tasks available here are optional and independent of one another. At any time you may also optionally visit the other tabs for further configuring such as de/activating any tools appropriately and fine-tuning parameters.
Start configuring a new execution profile: Select a pre-defined Run mode (germline, somatic, de novo/family trio) and/or Parameter set (these can be tuned further in each tool tab). Then in Step 3, click Apply Profile to propagate settings to the other tabs.
Upload execution profile from uploaded file: Click Browse..., selecting a file from your computer using the dialog box that appears, and then clicking Upload File. Then in Step 3, click Apply Profile to propagate settings to the other tabs.
TIP: Execution profiles may be re-used across different computational configurations to ensure consistency and reproducibility, an approach that may be helpful in analyzing batches of sample sets in piecewise fashion.
Select genomic regions: Select one option. For the user-defined list option, you can type or copy-paste directly into the textbox or supply a local file (click Browse..., select local file from the dialog box that appears, click Upload File). The format of the list must be either (a)
<chr> <start> <stop>
triples (one per line), or (b) a comma-separated list such as1-4,X,6:1000,5:1000-2000,22
. Prepend the chromosome numbers withchr
if indicated by the reference genome. Click Apply Profile to propagate settings to the other tabs.Reset: Reinitialize all computational options to their default (possibly empty) values. CAUTION: This operation also clears the visible account information (however, sessionIDs are preserved but must be re-entered).
Individual Tools:
A selection of tools from among those most often relied upon. GenomeVIP associates these tools with three common study types in the following way:
Germline VarScan, GATK, BreakDancer, Pindel, Genome STRiP Somatic VarScan, MuTect, Strelka, BreakDancer, Pindel De novo/Trio VarScan, BreakDancer, Pindel Documentation on these tools can be obtained by following the links to the tools’ home page in Further Information.
Post-discovery Tools:
Options to filter and annotate raw variants are provided and are applied in the order displayed.
Filtering
Identify/Remove dbSNP variants: Database options include the dbSNP database (provided on the GenomeVIP runtime image) or a user-supplied VCF file, accessible via public FTP/HTTP/HTTPS or user’s/public AWS S3 location.
Identify/Remove false positives: This approach is based on the bam-readcount tool and a series of heuristics for identify variant calls of lower quality and is implemented via the VarScan tool. The parameter values shown are considered by some to be generally appropriate.
Note: This option is independent of the false-positives annotation provided by the panel-of-normals option available with MuTect somatic variant discovery.
Annotation
- The Variant Effect Predictor (VEP) software with human genome reference has been installed on the GenomeVIP runtime image. A user-provided VCF file, accessible via public FTP/HTTP/HTTPS or user’s/public AWS S3 location, may alternatively be supplied.
Submit
Computing/storage resources and additional information
For AWS:
Select a compute resource: This can be a new cluster, or, if you have previously instantiated a cluster under the current GenomeVIP SessionID, a running cluster. Re-using an existing resource may have certain cost efficiencies.
Select a “bucket” for storing results: Buckets are uniquely named directories or folders in AWS’s S3 resource. Select an existing bucket, or create a new one by clicking Create a new bucket (the list will update automatically).
Note: buckets can also be viewed/crated in your S3 Console (see Shortcuts under Quick Start).
Additional information:
- Supply the full paths of any files required by the configuration.
- Optionally provide a comment that will appear in the generated execution profile.
For local clusters:
- Select a compute resource: No user selection is provided here; the resource is actually specified through the fields under Accounts.
- Additional information:
- Supply the full paths of any files required by the configuration.
- Optionally provide a comment that will appear in the generated execution profile.
- Provide the name of a working directory (it will be created, if necessary) into which GenomeVIP will copy the generated execution profile and master job script. This directory is assumed to be relative to your home directory unless overridden by specifying a full path.
- Select submit action: Choose whether to execute (default is ‘yes’) the job script in the working directory. (Here, “power users” may wish just to transmit the script for inspection or modification by hand, after which time it can be run as a standard shell script.)
Pre-submission checks (available at any time during the configuration process):
- Click Preview to display the current execution profile, or retrieve it as a file by clicking Download.
- Click Validate to have GenomeVIP perform basic checks and flag certain misconfigurations.
- Click Submit to perform the submit action specified above.
- Finally, clicking Reset sets all options to their default (possibly empty) values. (This is the same behavior described above for Execution Profile > Quick Setup > Reset).
Results
- For AWS:
- Navigate your web browser to your Amazon S3 Console (see Shortcuts/Tips under Quick Start)
- In the list of buckets on the lefthand side, click on the bucket you specified when submitting the job. After the page updates, click on the folder corresponding to the jobID assigned to the computation.
- The “results” folder contains downloadable files containing variant calls according to sample sets. Inter
- The “status” folder displays sentinel files indicating which tasks/computations did not go to completion as expected. These filenames can be used to create additional jobs to provide the missing results.
- For local clusters:
- Log in to your cluster account.
- Change to the working directory specified at the time of job submission.
- The “results” directory contains a summary of variant calls obtained.
- For AWS:
Options
This tab provides access to some additional features for working with AWS:
- AMI specification: The GenomeVIP runtime images are expected to be revised to include updates to the underlying operating system as well as to include minor tool bugfixes and feature enhancements. Here, users may can enter an alternative AMI ID in the specified format for instantiations rather than the default ID as programmed into the GenomeVIP server files.
- Settings: By default, GenomeVIP employs secure transfers (HTTPS) to/from S3 as implemented by the S3 Tools package and requests S3 server-side encryption, both of which options may be disabled. The default settings are recommended even when working with public data.
- Cluster management: Running EC2 clusters associated with the current GenomeVIP SessionsID are listed, each having the option to be terminated from GenomeVIP interface instead of from the EC2 Dashboard.
Customized Usage¶
GenomeVIP’s server and runtime environments can be customized, updated, or extended. Some examples of how users might customize its usage are list below.
- Execution profiles from a previous computation may be modified prior to uploading to the server during job configuration. In this way, altering specific numerical values or datafile paths can be accomplished quickly directly instead of locating the corresponding settings in the interface. This approach may be helpful to users who wish to import tool parameter settings from other pipelines into GenomeVIP.
- Users can alternatively specify the location of custom annotation or filtering VCFs that are accessible to GenomeVIP via public FTP/HTTP/HTTPS or the user’s/public Amazon S3 cloud storage location. This option is very convenient because neither server nor runtime image modification is required.
Building GenomeVIP¶
GenomeVIP is divided conceptually into server and runtime environments, the manifestations of which depend on several factors, including the expected user base (i.e., casual vs. power users) and method for access (centralized server, personal virtual machine, etc.). Our general recommendation is to install the genomics software systematically into a user-accessible location before building the server, as the server environment contains configuration files (in INI format) that will be modified to point to the locations of these genomics tools.
Runtime Environment (General Instructions)
If the runtime build involves installing an operating system (on hardware) or instantiating a base operating system (on AWS, etc.), you should generally install all package updates.
Select a location for installing software, such as /usr/local, a home directory, common workgroup disk, etc.
Download and install named software on to your target machine. (Links to software home pages are provided in Further Information.) For example, our Amazon runtime images have keypairs like
[samtools] path=/usr/local/bin/samtools/1.2/bin version=1.2 exe=samtools
CAUTION: Some software carries mimimum version requirements of Java. For example, GATK-3.5 does not support JRE 7u51, but we find it does support 7u80.
Modify the paths in GenomeVIP’s
configsys/tools.info.*
files to agree with your installed software.On AWS: Create an image (e.g. AWS > EC2 Dashboard > Actions > Image > Create Image) from this instance and note the machine image ID (AMI). This AMI ID can be pre-programmed into the server environment as the default runtime image (see file
versions.php
). Additionally, if operating system up runtime image.
Server Environment (General Instructions)
Install an operating system (on hardware) or instantiate a base operating system (on AWS, etc.)
TIP: the GenomeVIP server has been installed and run successfuly under Ubuntu on Amazon EC2, VirtualBox, and OpenStack platforms.
Install all package updates; then install a web server (e.g. Apache), PHP, and mod-php package families
Configure the web server:
- Use HTTPS only
- Disable insecure SSL protocols
- Add/Enable quality SSL ciphers
- Allow the directory serving GenomeVIP to run PHP scripts
Install GenomeVIP:
Download the application to the serving directory
git clone https://github.com/ding-lab/GenomeVIP.git
Post-installation
- Create an image (e.g. AWS > EC2 Dashboard > Actions > Image > Create Image) to preserve the installation for future use.
Extending GenomeVIP¶
GenomeVIP’s server and runtime environments can be customized, updated, or extended. Some examples of how developers might extend its capabilities are given below.
- Updates to tools or to the operating system not requiring user interface changes in the server environment can be readily carried out. On AWS, a new runtime image would be generated whose AMI ID may be furnished to the server as an alternative image via Options.
- Tool updates (or new tools) requiring user interface modification: a full description is too
large to fit in the margin, but suffice it to say the process involves steps such as
installing the tool in the runtime environment, updating path information in server
environment (see files
configsys/tools.info.*
), and modifying the various HTML, PHP, and text files underlying the user interface content and functionality. Insight into this process may be able to be gleaned from the GenomeVIP repository’s commit 08e22db entitled “added gatk module”. Finally, on AWS, new images of both environments would then be generated. - Custom/New execution profiles may be installed into the server environment and added to the
list of options available to users. See the profile files
*.prof
inside directoriesconfigsys/profiles/
andconfigsys/run_modes/
). On AWS, a new server image would need to be generated.
How to Cite GenomeVIP¶
In published works:
Mashl RJ, Scott AD, Huang KL, Wyczalkowski MA, Yoon CJ, Niu B, DeNardo E, Yellapantula VD, Handsaker RE, Chen K, Koboldt DC, Ye K, Fenyö D, Raphael BJ, Wendl MC, Ding L. GenomeVIP: A Cloud Platform for Genomic Variant Discovery and Interpretation.
In electronic documents:
Further Information¶
Detection Tools¶
Supporting Tools¶
bam-readcount | https://github.com/genome/bam-readcount/ |
Java SE Runtime Environment | http://www.oracle.com/technetwork/java/javase/downloads/ |
S3 Tools | http://s3tools.org/s3cmd/ |
SAMtools | http://www.htslib.org/ |
SnpEff | http://snpeff.sourceforge.net/ |
StarCluster | http://star.mit.edu/cluster/ |
Variant Effect Predictor | http://ensembl.org/info/docs/tools/vep/ |
VCF tools | https://vcftools.github.io/ |
Amazon AWS Documentation¶
Documentation home | https://aws.amazon.com/documentation/ |
EC2 home | https://aws.amazon.com/documentation/ec2/ |
S3 home | https://aws.amazon.com/documentation/s3/ |
EBS home | http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html |