RefComp



/------------------------------------------------------------------------------
|                                                                             |
|  Program: Refcomp                                                           |
|  Version: 4.x                                                               |
|  Copyright (C) 1996-1997                                                    |
|  by Deborah A. Nickerson, Natali Kolker, Scott Taylor, and Mark Rieder      |
|  University of Washington                                                   |
|                                                                             |
|  All rights reserved.                                                       |
|                                                                             |
|  This software may not be redistributed, distributed                        |
|  in modified form, or used for any commercial purpose, including            |
|  commercially funded sequencing, without written permission from            |
|  the authors and the University of Washington.                              |
|                                                                             |
|  This software is provided ``AS IS'' and any express or implied             |
|  warranties, including, but not limited to, the implied warranties of       |
|  merchantability and fitness for a particular purpose are disclaimed.       |
|  In particular, this disclaimer applies to any diagnostic purpose. In no    |
|  event shall the authors or the University of Washington be liable for      |
|  any direct, indirect, incidental, special, exemplary, or consequential     |
|  damages (including, but not limited to, procurement of substitute goods    |
|  or services; loss of use, data, or profits; or business interruption)      |
|  however caused and on any theory of liability, whether in contract,        |
|  strict liability, or tort (including negligence or otherwise) arising      |
|  in any way out of the use of this software, even if advised of the         |
|  possibility of such damage.                                                |
|                                                                             |
------------------------------------------------------------------------------/


DESCRIPTION
-----------

  Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA
  sequence variation in the human genome.  The identification and typing of
  these variations plays a central role in analyzing the relationships between
  genome structure and function, and in understanding the allelic variation
  within and among populations.  In addition, the typing of SNPs also plays a 
  role in identifying mutated oncogenes, genetic and infectious diseases, in
  matching tissues prior to transplantation, and in forensic and paternity 
  testing.

  Many techniques are used to identify sequence variants among different
  individuals using DNA amplified by the polymerase chain reaction (PCR).  These
  include denaturing gel electrophoresis, chemical or enzymatic cleavage, 
  heteroduplex analysis, the analysis of single-stranded DNA conformations, and
  direct sequencing of the PCR product.  Of these methods, direct DNA sequencing
  is the most accurate and the most automated approach for scanning DNA fragments
  for variation.  Furthermore, it is the only method that provides complete
  information about the location and nature of the sequence variation using a
  single set of reagents and assay conditions.  Despite the advantages of 
  sequencing PCR products to identify DNA variations, there is one drawback:
  it is difficult to accurately identify heterozygous sites within a sequence.
  By comparing sequence traces containing variant sites for homozygotes and
  heterozygotes we have noted two consistent changes: 1) a significant drop in
  fluorescence peak height at a variant site when sequence traces obtained from
  homozygous individuals are compared to traces from heterozygous individuals,
  and 2) the presence of a second fluorescence peak in sequence traces from
  heterozygous individuals (1, 2).  We have developed a program known as 
  PolyPhred to scan for these two features when sequence traces are being 
  compared (2). 

  Refcomp (6) was designed to analyze sequencing traces which contains data from
  strictly homozygous samples (eg. cloned DNA, mitochondrial DNA, etc.).  This data
  represents a special case which can be analyzed for mismatches with a known
  reference sequence.  Refcomp will determine the high quality positions within an
  assembled DNA contig and produce a report listing sites which differ from a defined
  reference sequence.


HOW REFCOMP WORKS
-------------------

  Refcomp is designed as a member of an integrated suite of sequence analysis 
  applications which includes Phred (3,4),  Phrap (5), and Consed (6), 
  and is not a stand alone program. Phred provides the base-calls, base-call 
  quality  information and the peak size information.  Phrap is used to assemble
  the  input  sequences into one or more contigs. 

  Refcomp will parse contig alignment data files (.ace files), produced by the
  program Phrap, and identify sites in the consensus sequence which differ from
  a defined reference sequence.  Refcomp will generate data tags for the
  mismatched sites and the Consed program is used to visually review the Refcomp
  output.  The Refcomp output is also written to a file in a format that can be
  easily parsed into a database program.  


COMMAND LINE
------------

  1)  -a (or -ace) [name of ace file]

      The user must supply the name of the ace file containing the
      assembly information for the assembly to be scanned.  Required.

  2)  -d (or -dir) [directory name]

      The user can supply the name of directory.
      Optional.  Default "../"

  3)  -q (or -quality) [number between 0 and 50]

      The Phred quality value used to determine search limits and
      to limit the base mismatches reported in the output.
      Optional.  Default 20.

  4)  -n (or -nav) [name of navigation file]

      Refcomp generates a navigation file, which assists in locating marked
      sites in the Consed program.  The -n flag allows the user to specify the
      name of this file.
      Optional.  Default "refcomp.nav"


OUTPUT
------

  Refcomp writes output both to the standard output port and the standard error
  port.  The data written to the standard output contains information about the 
  analysis, such as the location of a putative polymorphism and the putative 
  genotypes for each sample.  The standard error stream contains messages about 
  the progress of Refcomp.  This feature allows the standard output to be redirected 
  to a file, while the progress messages are printed on the screen.

  In the standard output, data segments are separated by beginning and ending tokens
  to make parsing of the data easy.

  * NOTE: The organization of the standard output has changed with Refcomp version 4.0.


DATA SEGMENTS IN THE STANDARD OUTPUT
------------------------------------

  1)  The first line contains the name of the program and the version.  This line
      is followed by a THUMBPRINT and DATE/TIME stamp that uniquely identifies
      a particular run.

  2)  COMMAND_LINE
      This segment contains the parameters used for the run.
      If optional parameters were not supplied on the command line, then
      the defaults are indicated.

  3)  CONTIG
      This segment contains the information for a contig.  The name of the 
      contig follows the BEGIN_CONTIG token.

  4)  SITES
      This segment contains information for each mismatched site, written into
      columns.  Columns 1 and 2 contain the position of the mismatched site
      in the consensus sequence and reference sequence, respectively.
      Column 3 contains the the consensus sequence phrap quality at the site.  
      Columns 4 and 5 indicate whether there is a read covering that site in the 
      forward (F) and reverse (R) direction.  Column 6 indicates whether the site 
      is confirmed (C) by having reads in both directions.  Columns 7 and 8 contain 
      the base at the site in the consensus sequence and the reference sequence, 
      respectively.  In the case of a gap, a dash (-) is shown in one of the
      columns and one or bases covering the gap appear in the other column.
        
  5)  TOTAL_SITES
      Following this token is the total number of mismatched sites.

  6)  READS
      This segment contains for each mismatched site information obtained from 
      the reads that cover the site.  Columns 1, 2 and 3 contain the position of
      the mismatched site in the consensus sequence, reference sequence and the
      read sequence, respectively.  Columns 4 and 5 contain the phred and phrap
      qualities at the site, respectively.  Column 6 constains the name of a
      chromatogram file that covers the site.

  7)  COVERAGE
      This segment reports the regions of the contig in which the average 
      quality is greater than or equal to the quality parameter.  Columns 1
      and 2 contain the start and end positions on the consensus sequence.
      Column 3 contains the length of the region.

  8)  TOTAL_COVERAGE
      This token is followed by the sum of the region lengths listed in the
      COVERAGE segment.


SCREEN OUTPUT
---------------

  Refcomp reports its progress to the screen as it runs.  Messages reporting any errors
  that occur will also appear on the screen.


USING YOUR OWN DATA
-------------------

  These instructions will allow you to take chromatograms, analyze them with Phred and Phrap 
  and run Refcomp on the .ace file.  

  1)  Install Refcomp (see the README file).

  2)  Create the following directory structure in your working directory

        ./chromat_dir
        ./edit_dir
        ./phd_dir

  3)  Move your chromats into 'chromat_dir'

  4)  A .phd file of your reference sequence is needed for Refcomp to identify homozygous
      differences from the consensus sequence assembled by Phrap.  The simplest way to make
      this file is to run the program "mktrace", which is part of Consed installation
      package.  "mktrace" requires a FASTA input file and will output a "pseudo" .phd file.
      Once the .phd file is generated it should be copied into the 'phd_dir' prior to
      assembly of your chromatograms using Phrap.  See the notes below for more information
      on creating a "reference sequence".

  5)  Change directory into 'edit_dir'

  6)  Type 'phredPhrap'.  This step will run phred and phrap.

  7)  Run refcomp.

  8)  Run consed to view the assembly.



EXAMPLES :
------------

  /home/gene ---|---edit_dir----|--- gene.fasta.screen.ace.1 
                |---chromat_dir
                |---phd_dir 


  1)  You can run Refcomp from any directory by typing :

        refcomp -d /home/gene -a gene.fasta.ace.1 > gene.output

      or

        refcomp -d /home/gene -a *.ace.1 > gene.output

      or you can run Refcomp from the gene's edit_dir directory:

        cd /home/gene/edit_dir/
        refcomp -a *.ace.1 > gene.output
     
  2)  If you would like to run Refcomp with a different quality threshold
      you may use the command:

        refcomp -a *.ace.1 -q 30 > gene.output


NOTES ON ANALYSIS
-----------------

  The program Refcomp relies on the comparison of the consensus sequence 
  generated by Phrap and a "reference sequence" which assembles in the same
  contig.  To create a reference sequence, you need a file containing your
  reference sequence of interested.  This sequence can then be formatted to
  look like a .phd file (as described above) by manually creating this file
  or using a program to convert your reference sequence to standard .phd format.
  The reference sequence file must then placed in the phd_dir so that Phrap can
  include the sequence in the assembly.

  * Note:  To be recognized as a reference sequence, the name of the reference sequence must
  contain the character string "REF" in upper or lower case.  Examples of other 
  reference sequences for the mitochondrial genome can be found at:

    http://droog.mbt.washington.edu

  Refcomp tags all mismatched sites with the specific "polymorphism" tag found
  in Consed.  This enables the users to easily and rapidly navigate between sites 
  and allows for easy verfication by the data analyst.  

  Refcomp has a default setting of quality 20 for tagging mismatched sites. 
  The quality setting has been used in previous studies for finding homoplasmic
  polymorphisms in the mitochondrial genome with single pass sequence coverage
  (6).  


REFERENCES
----------

(1)  Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A.,
     1994, "Comparative analysis of human DNA variations by fluorescence-
     based sequencing of PCR products", Genomics, 25, 615-622.

(2)  Nickerson, D.A., Tobe, V.O., Taylor, S.L., 1997 "PolyPhred: automating
     the detection and genotyping of single nucleotide substitutions using 
     fluorescence-based resequencing"  Nucleic Acids Res.
     25(14): 2745-2751.

(3)  Ewing, B. and Green, P., 1992, Phred, unpublished.
     http://www.genome.washington.edu/

(4)  Green, P., 1994, Phrap, unpublished.
     http://www.genome.washington.edu/

(5)  Gordon, D., 1995, Consed. unpublished.
     http://www.genome.washington.edu

(6)  Rieder, M.J., Taylor, S.L., Tobe, V.O., and Nickerson, D.A.,  
     1998 "Automating the identification of DNA variations using 
     quality-based fluorescence re-sequencing: analysis of the human 
     mitochondrial genome",  Nucleic Acids Res. 26(4):967-973.