RefComp
/------------------------------------------------------------------------------
| |
| Program: Refcomp |
| Version: 4.x |
| Copyright (C) 1996-1997 |
| by Deborah A. Nickerson, Natali Kolker, Scott Taylor, and Mark Rieder |
| University of Washington |
| |
| All rights reserved. |
| |
| This software may not be redistributed, distributed |
| in modified form, or used for any commercial purpose, including |
| commercially funded sequencing, without written permission from |
| the authors and the University of Washington. |
| |
| This software is provided ``AS IS'' and any express or implied |
| warranties, including, but not limited to, the implied warranties of |
| merchantability and fitness for a particular purpose are disclaimed. |
| In particular, this disclaimer applies to any diagnostic purpose. In no |
| event shall the authors or the University of Washington be liable for |
| any direct, indirect, incidental, special, exemplary, or consequential |
| damages (including, but not limited to, procurement of substitute goods |
| or services; loss of use, data, or profits; or business interruption) |
| however caused and on any theory of liability, whether in contract, |
| strict liability, or tort (including negligence or otherwise) arising |
| in any way out of the use of this software, even if advised of the |
| possibility of such damage. |
| |
------------------------------------------------------------------------------/
DESCRIPTION
-----------
Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA
sequence variation in the human genome. The identification and typing of
these variations plays a central role in analyzing the relationships between
genome structure and function, and in understanding the allelic variation
within and among populations. In addition, the typing of SNPs also plays a
role in identifying mutated oncogenes, genetic and infectious diseases, in
matching tissues prior to transplantation, and in forensic and paternity
testing.
Many techniques are used to identify sequence variants among different
individuals using DNA amplified by the polymerase chain reaction (PCR). These
include denaturing gel electrophoresis, chemical or enzymatic cleavage,
heteroduplex analysis, the analysis of single-stranded DNA conformations, and
direct sequencing of the PCR product. Of these methods, direct DNA sequencing
is the most accurate and the most automated approach for scanning DNA fragments
for variation. Furthermore, it is the only method that provides complete
information about the location and nature of the sequence variation using a
single set of reagents and assay conditions. Despite the advantages of
sequencing PCR products to identify DNA variations, there is one drawback:
it is difficult to accurately identify heterozygous sites within a sequence.
By comparing sequence traces containing variant sites for homozygotes and
heterozygotes we have noted two consistent changes: 1) a significant drop in
fluorescence peak height at a variant site when sequence traces obtained from
homozygous individuals are compared to traces from heterozygous individuals,
and 2) the presence of a second fluorescence peak in sequence traces from
heterozygous individuals (1, 2). We have developed a program known as
PolyPhred to scan for these two features when sequence traces are being
compared (2).
Refcomp (6) was designed to analyze sequencing traces which contains data from
strictly homozygous samples (eg. cloned DNA, mitochondrial DNA, etc.). This data
represents a special case which can be analyzed for mismatches with a known
reference sequence. Refcomp will determine the high quality positions within an
assembled DNA contig and produce a report listing sites which differ from a defined
reference sequence.
HOW REFCOMP WORKS
-------------------
Refcomp is designed as a member of an integrated suite of sequence analysis
applications which includes Phred (3,4), Phrap (5), and Consed (6),
and is not a stand alone program. Phred provides the base-calls, base-call
quality information and the peak size information. Phrap is used to assemble
the input sequences into one or more contigs.
Refcomp will parse contig alignment data files (.ace files), produced by the
program Phrap, and identify sites in the consensus sequence which differ from
a defined reference sequence. Refcomp will generate data tags for the
mismatched sites and the Consed program is used to visually review the Refcomp
output. The Refcomp output is also written to a file in a format that can be
easily parsed into a database program.
COMMAND LINE
------------
1) -a (or -ace) [name of ace file]
The user must supply the name of the ace file containing the
assembly information for the assembly to be scanned. Required.
2) -d (or -dir) [directory name]
The user can supply the name of directory.
Optional. Default "../"
3) -q (or -quality) [number between 0 and 50]
The Phred quality value used to determine search limits and
to limit the base mismatches reported in the output.
Optional. Default 20.
4) -n (or -nav) [name of navigation file]
Refcomp generates a navigation file, which assists in locating marked
sites in the Consed program. The -n flag allows the user to specify the
name of this file.
Optional. Default "refcomp.nav"
OUTPUT
------
Refcomp writes output both to the standard output port and the standard error
port. The data written to the standard output contains information about the
analysis, such as the location of a putative polymorphism and the putative
genotypes for each sample. The standard error stream contains messages about
the progress of Refcomp. This feature allows the standard output to be redirected
to a file, while the progress messages are printed on the screen.
In the standard output, data segments are separated by beginning and ending tokens
to make parsing of the data easy.
* NOTE: The organization of the standard output has changed with Refcomp version 4.0.
DATA SEGMENTS IN THE STANDARD OUTPUT
------------------------------------
1) The first line contains the name of the program and the version. This line
is followed by a THUMBPRINT and DATE/TIME stamp that uniquely identifies
a particular run.
2) COMMAND_LINE
This segment contains the parameters used for the run.
If optional parameters were not supplied on the command line, then
the defaults are indicated.
3) CONTIG
This segment contains the information for a contig. The name of the
contig follows the BEGIN_CONTIG token.
4) SITES
This segment contains information for each mismatched site, written into
columns. Columns 1 and 2 contain the position of the mismatched site
in the consensus sequence and reference sequence, respectively.
Column 3 contains the the consensus sequence phrap quality at the site.
Columns 4 and 5 indicate whether there is a read covering that site in the
forward (F) and reverse (R) direction. Column 6 indicates whether the site
is confirmed (C) by having reads in both directions. Columns 7 and 8 contain
the base at the site in the consensus sequence and the reference sequence,
respectively. In the case of a gap, a dash (-) is shown in one of the
columns and one or bases covering the gap appear in the other column.
5) TOTAL_SITES
Following this token is the total number of mismatched sites.
6) READS
This segment contains for each mismatched site information obtained from
the reads that cover the site. Columns 1, 2 and 3 contain the position of
the mismatched site in the consensus sequence, reference sequence and the
read sequence, respectively. Columns 4 and 5 contain the phred and phrap
qualities at the site, respectively. Column 6 constains the name of a
chromatogram file that covers the site.
7) COVERAGE
This segment reports the regions of the contig in which the average
quality is greater than or equal to the quality parameter. Columns 1
and 2 contain the start and end positions on the consensus sequence.
Column 3 contains the length of the region.
8) TOTAL_COVERAGE
This token is followed by the sum of the region lengths listed in the
COVERAGE segment.
SCREEN OUTPUT
---------------
Refcomp reports its progress to the screen as it runs. Messages reporting any errors
that occur will also appear on the screen.
USING YOUR OWN DATA
-------------------
These instructions will allow you to take chromatograms, analyze them with Phred and Phrap
and run Refcomp on the .ace file.
1) Install Refcomp (see the README file).
2) Create the following directory structure in your working directory
./chromat_dir
./edit_dir
./phd_dir
3) Move your chromats into 'chromat_dir'
4) A .phd file of your reference sequence is needed for Refcomp to identify homozygous
differences from the consensus sequence assembled by Phrap. The simplest way to make
this file is to run the program "mktrace", which is part of Consed installation
package. "mktrace" requires a FASTA input file and will output a "pseudo" .phd file.
Once the .phd file is generated it should be copied into the 'phd_dir' prior to
assembly of your chromatograms using Phrap. See the notes below for more information
on creating a "reference sequence".
5) Change directory into 'edit_dir'
6) Type 'phredPhrap'. This step will run phred and phrap.
7) Run refcomp.
8) Run consed to view the assembly.
EXAMPLES :
------------
/home/gene ---|---edit_dir----|--- gene.fasta.screen.ace.1
|---chromat_dir
|---phd_dir
1) You can run Refcomp from any directory by typing :
refcomp -d /home/gene -a gene.fasta.ace.1 > gene.output
or
refcomp -d /home/gene -a *.ace.1 > gene.output
or you can run Refcomp from the gene's edit_dir directory:
cd /home/gene/edit_dir/
refcomp -a *.ace.1 > gene.output
2) If you would like to run Refcomp with a different quality threshold
you may use the command:
refcomp -a *.ace.1 -q 30 > gene.output
NOTES ON ANALYSIS
-----------------
The program Refcomp relies on the comparison of the consensus sequence
generated by Phrap and a "reference sequence" which assembles in the same
contig. To create a reference sequence, you need a file containing your
reference sequence of interested. This sequence can then be formatted to
look like a .phd file (as described above) by manually creating this file
or using a program to convert your reference sequence to standard .phd format.
The reference sequence file must then placed in the phd_dir so that Phrap can
include the sequence in the assembly.
* Note: To be recognized as a reference sequence, the name of the reference sequence must
contain the character string "REF" in upper or lower case. Examples of other
reference sequences for the mitochondrial genome can be found at:
http://droog.mbt.washington.edu
Refcomp tags all mismatched sites with the specific "polymorphism" tag found
in Consed. This enables the users to easily and rapidly navigate between sites
and allows for easy verfication by the data analyst.
Refcomp has a default setting of quality 20 for tagging mismatched sites.
The quality setting has been used in previous studies for finding homoplasmic
polymorphisms in the mitochondrial genome with single pass sequence coverage
(6).
REFERENCES
----------
(1) Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A.,
1994, "Comparative analysis of human DNA variations by fluorescence-
based sequencing of PCR products", Genomics, 25, 615-622.
(2) Nickerson, D.A., Tobe, V.O., Taylor, S.L., 1997 "PolyPhred: automating
the detection and genotyping of single nucleotide substitutions using
fluorescence-based resequencing" Nucleic Acids Res.
25(14): 2745-2751.
(3) Ewing, B. and Green, P., 1992, Phred, unpublished.
http://www.genome.washington.edu/
(4) Green, P., 1994, Phrap, unpublished.
http://www.genome.washington.edu/
(5) Gordon, D., 1995, Consed. unpublished.
http://www.genome.washington.edu
(6) Rieder, M.J., Taylor, S.L., Tobe, V.O., and Nickerson, D.A.,
1998 "Automating the identification of DNA variations using
quality-based fluorescence re-sequencing: analysis of the human
mitochondrial genome", Nucleic Acids Res. 26(4):967-973.