Documentation for PolyPhred Version 4.0

Program: PolyPhred
Version: 4.0
Copyright (C) 2002-2006
by Deborah A. Nickerson, Scott Taylor, Natali Kolker and Jim Sloan
University of Washington

All rights reserved.

This software is part of a test version of the PolyPhred distribution package. It may not be redistributed, distributed in modified form, or used for any commercial purpose, including commercially funded sequencing, without written permission from the authors and the University of Washington.

This software is provided ``AS IS'' and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.


Contents

  1. Description of Features
  2. Setup and Operating Instructions
  3. More Information


Introduction

Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA sequence variation in the human genome. The identification and typing of these variations plays a central role in analyzing the relationships between genome structure and function, and in understanding the allelic variation within and among populations.

Many techniques are used to identify sequence variants among different individuals using DNA amplified by the polymerase chain reaction (PCR). These include denaturing gel electrophoresis, chemical or enzymatic cleavage, heteroduplex analysis, the analysis of single-stranded DNA conformations, variant detector arrays, and direct sequencing of a PCR product. PolyPhred is a program that helps to accurately identify heterozygous sites in sequences produced by sequencing PCR products with fluorescence-based chemistries such as dye labeled terminators or dye-labeled primers. The program compares sequence traces and searches for homozygotes and heterozygotes.

Detection of heterozygous sequences is based on finding: (1) a significant drop in fluorescence peak height at a variant site when sequence traces obtained from homozygous individuals are compared to traces from heterozygous individuals (theoretical drop is expected to be 50%), and (2) the presence of a second fluorescence peak in sequence traces from heterozygous individuals (see references 1 and 2). PolyPhred scans for these two features when sequence traces are being compared to detect heterozygotes among homozygotes (reference 2).


How PolyPhred Works

PolyPhred identifies substitution SNPs as potential heterozygotes by comparing traces in a sequence assembly. PolyPhred is designed as a member of an integrated suite of sequence analysis applications which includes Phred (references 3,4), Phrap (reference 5), and Consed (reference 6), and is not a stand alone program.

Phred provides the base-calls, base-call quality information and the peak size information. The information is stored in two types of files called PHD and POLY files. Phrap is used to assemble the input sequences into one or more contigs. The assembly information is stored in a file called the ACE file. PolyPhred uses all three file types to analyze the sequence traces. It begins by reading the ACE file, and uses this file to locate all of the other needed files. When running PolyPhred, the user has the option of either specifying the ACE file explicitly using the -ace flag in the command line (see The Flags), or allowing PolyPhred to locate the ACE file.

PolyPhred identifies SNP sites among the traces and assigns a rank indicating how well the trace at a site matches the expected pattern for a SNP (see How PolyPhred ranks SNP sites). After PolyPhred identifies the putative heterozygous sites, it updates the ACE and PHD files by adding tags that indicate the positions of the sites. The Consed program can then be used to examine the tagged sites. PolyPhred also generates a detailed report listing positions, genotypes and ranks of polymorphic sites in a format that can be easily parsed into a database program.


What is new in PolyPhred Version 4.0

  • There are several new command-line flags for controlling the operation of PolyPhred. All flags, including the -ace flag, are now optional.

  • New blocks have been added to the output report to provide more information.

  • Three types of manually-applied tags can be captured from ACE and PHD files and reported in the output.

  • If a reference sequence is used in the assembly process with Phred and Phrap, PolyPhred can report the positions of polymorphic sites relative to the reference sequence.

  • PolyPhred can be customized to suit user preferences by creating a .polyphredrc file. This file can be used to set defaults to preferred values, and to identify manual tags for manual tag tracking.

  • PolyPhred can generate a navigation file that facilitates visualization of polymorphic sites within the Consed application.

  • PolyPhred can search sample sequences for evidence of insertion/deletion (indel) polymorphisms.

  • PolyPhred now runs about three times faster than previous versions, and uses about one-third the memory.


    The Flags

    Parameters and options that govern the operation of PolyPhred can be changed using command-line flags. All of the flags are optional. If no flags are used, PolyPhred uses the ACE file with the highest number at the end of the file name, and all other parameters are set to the defaults.

    Differences with PolyPhred version 3.5 are indicated in red. Note that the -background (-b) and -ratio flags, which were used in previous versions of PolyPhred, are no longer functional in version 4.0. See the section How PolyPhred ranks SNP sites for a discussion on how to alter the way PolyPhred ranks sites.

    Many of the flags have an abbreviated form, which are shown in parentheses. For those flags that take an argument, the argument is shown in square brackets ([ ]). Optional arguments are indicated in green.

    -ace (-a) [ace file]
    Specifies the ACE file. If this flag is omitted, PolyPhred uses the ACE file with the highest final number in the file name.

  • This flag is now optional.

    -block [list of block names]
    Causes PolyPhred to include or exclude from the output report the specified blocks (i.e. POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, MANUALGENOTYPE, VERIFIED, SAMPLE and COVERAGE). To include a block, immediately precede the block name with a plus (+), To exclude a block, immediately precede the block name with a minus (-). For example, to exclude the SAMPLE and COVERAGE blocks from the output report, add this to the command line:

      -block -SAMPLE -COVERAGE
    
    Note that if the indel option is set to off, the INDEL block will not appear in the output.
  • Built-in default: all blocks except the INDEL block are included in the output report.
  • This is a new flag.

    -clear
    Causes PolyPhred to clear all polyPhred tags from the ACE and PHD files.

  • This is a new flag.

    -dir (-d) [work directory]
    Specifies the directory in which the data is located. This flag allows the user to run PolyPhred from a directory other than the one containing the data to be analyzed. If this flag is not used, PolyPhred should be run from the edit_dir directory of the data set to be analyzed.

  • Built-in default: ../

    -flanking (-f) [number]
    Specifies the number of bases flanking a polymorphic site to report in the POLY block of the output.

  • Accepted values: 0 - 50
  • Built-in default: 10
  • This is a new flag.

    -group (-g) [regular expression]
    Specifies the files to be used in the analysis. PolyPhred analyzes only those sequences with a name that matches the regular expression.

  • Built-in default:  .+

    -help (-h)
    Causes PolyPhred to print the help text. The flags are listed along with their allowed and default values.

    -indel (-i) [on / off]
    Allows the user to switch on or off the search for indel polymorphisms. If no argument is given, this feature is turned on. The indel search function is currently under development (see Detection of Insertion/Deletion Polymorphisms).

  • Built-in default: off
  • This is a new flag.

    -nav (-n) [file name / on / off]
    Causes PolyPhred to write a navigation file containing the polymorphic sites. If the file name is given but does not have a final ".nav" extension, PolyPhred adds one. The file is written to the edit_dir directory of the working directory. If the argument is "on", or no argument is not given, a navigation file is written using the default navigator file name.

  • Built-in default: off
  • Built-in default: polyphred.nav

    To use the navigation file in Consed, click on 'Navigate', located at the top of the 'Consed Main Window'. Then click on 'Custom Navigation'. The window that appears should contain the name of the navigation file. Click on the file name to bring up the navigation window.

  • This is a new flag.

    -output (-o) [output file name / on / off]
    Causes PolyPhred to write the results to an output report. If no output file name is given, the default output file name is used. If no path is specified, the file is written to the edit_dir directory of the working directory. If the argument is "on", or no argument is given, an output report is written using the default output file name. If this feature is turned off, the results are written to standard output, and can be redirected to a file.

  • Built-in default: off
  • Built-in default file name: polyphred.out
  • This is a new flag.

    -quality (-q) [number]
    Specifies the lower quality limit. This value affects the extent of the excluded regions at the ends of the sample sequences (regions shaded in yellow when viewed in Consed). Reducing this value results in shorter excluded regions, and increasing the value results in longer excluded regions. The quality limit can also affect the ranking of SNPs (see How PolyPhred ranks SNP sites).

  • Accepted values: 0 - 50
  • Built-in default: 30

    -rank (-r) [number]
    Specifies the lower limit for the rank assigned to SNP sites. PolyPhred marks only those sites that are assigned a rank from 1 to the specified limit, inclusive. Adjusting the rank limit will affect the stringency of the SNP search (see How PolyPhred ranks SNP sites).

  • Accepted values: 1 - 6
  • Built-in default: 3
  • -r is now used for -rank instead of -ratio.

    -ref [reference sequence identifier / on / off]
    Causes PolyPhred to include the position of polymorphic sites relative to a reference sequence. For each site, the reference sequence will be printed immediately after the consensus sequence position in the POLY, GENOTYPE, COLUMNGENOTYPE, MANUALGENOTYPE and INDEL blocks of the output. The reference sequence identifier argument is optional. If identifier is specified, PolyPhred searches for a PHD file name containing this string of characters. When such a file is found, the sequence within this file is set as the reference sequence. If the argument is "on", or no argument is given, this feature is turned on and the default reference sequence identifier is used.

  • Built-in default: off
  • Built-in default identifier: .REF
  • This is a new flag.

    -scale (-s) [number]
    Specifies the scale factor that PolyPhred will use to determine how well the trace of a putative SNP matches the ideal. Using this flag will set the scales for both the secondary/primary peak area ratio and the heterozygous/homozygous peak height ratio. These scales can be set independently using the -s1 and -s2 flags (see below). Adjusting the scale values will affect the ranking of SNPs (see How PolyPhred ranks SNP sites).

  • Accepted values: 0.01 - 100
  • Built-in default: 1.0
  • This is a new flag.

    -scale1 (-s1) [number]
    Specifies scale factor that PolyPhred will use to scale the secondary/primary peak secondary/primary peak area ratio (see the -s flag above).

  • Accepted values: 0.01 - 100
  • Built-in default: 1.0
  • This is a new flag.

    -scale2 (-s2) [number]
    Specifies scale factor that PolyPhred will use to scale the heterozygous/homozygous peak height ratio (see the -s flag above).

  • Accepted values: 0.01 - 100
  • Built-in default: 1.0
  • This is a new flag.

    -tag (-t) [tag type]
    Specifies the type of tag that PolyPhred will use to mark SNP sites. The three tag types are genotype, polymorphism, and rank. The tag types can be abbreviated as g, p and r, respectively.

  • Built-in default: genotype

    -update [on / off]
    Allows the user to control updating the ACE and PHD files. If updating is switched off, only the output report will be written. If the no switch if given, updating is turned on.

  • Built-in default: on
  • This is a new flag.

    -verbosity (-v) [number]
    Specifies the level of status reporting that PolyPhred will print to the screen as it is running. The allowed values range from 0 (least reporting) to 2 (most reporting).

  • Built-in default: 0
  • This is a new flag.

    -version
    Prints the PolyPhred version number and build number.

  • This is a new flag.

    -window (-w) [number]
    Specifies the window width. This value, together with the quality value, is used to determine how much of the ends of each sample sequence will be excluded from analysis.

  • Accepted values: 5 - 50
  • Built-in default: 40
  • This is a new flag.


    How PolyPhred ranks SNP sites

    The traces of a heterozygous site characteristically appear as two overlapping peaks. Ideally, the areas under the peaks are nearly the same, and the heights of the peaks are reduced by about a half of what the height of a homozygous peak would be at the same position. When PolyPhred identifies a putative heterozygous site in a sample sequence, it assigns the site a rank that indicates how well the traces of the two peaks fit the ideal pattern for a SNP. If a site is deemed not to be heterozygous, it is assigned a rank indicating how well it fits the expected trace for a homozygous site. The ranks range from 1, indicating a very good fit, to 6, indicating a very poor fit. (In fact, the vast majority of sites with ranks less than 3 are not polymorphic.)

    For each position in the consensus sequence, PolyPhred first examines all sites that are aligned at the position (i.e. the sites that line up in a column when the sequences are viewed in Consed). Each site is assigned a genotype and a rank. Next, PolyPhred counts the number of heterozygous sites with a rank equal to or better than a threshold called the rank limit. If there is at least one such site, PolyPhred marks the column as a polymorphic position and assigns an overall rank to the position. This rank is generally the rank of the best heterozygous site in the column. Consequently, the column rank is never below the rank limit.

    In Consed, columns that are marked polymorphic appears as blue with pink markers indicating the heterozygous sites. At the top of each column, the base in the consensus sequence is marked with a color indicating the overall rank:

    RankColor
    1red
    2orange
    3green
    4dark blue
    5magenta
    6purple

    PolyPhred primarily uses three factors to determine the rank of a heterozygous site. One is the ratio of the areas under the two peaks (called the area ratio). The second is the ratio of the actual height of one of the peaks to the height of a hypothetical homozygous peak (called the normalization ratio). The peak that is used corresponds to the consensus base at the position. The third factor is the average quality, assigned by Phred, of the sites flanking the heterozygous site (the two sites immediately adjacent to the heterozygous site are excluded from the average, as Phred typically reduces their quality due the heterozygous site itself). After assigning an initial rank based on these three factors, PolyPhred examines other aspects of the trace, such as the presence of a third peak, and adjusts the rank accordingly.

    There are several flags that the user can use to affect how PolyPhred ranks sites, and thereby increase or decrease the number of positions that are marked polymorphic. These flags and their effect on calling of polymorphic positions are discussed below.

    The -rank flag sets the rank limit. Setting the rank limit lower (i.e. toward 6) will result in more positions marked as polymorphic. However, the lower the rank limit, the more of calls will likely be false positives. The number of false positives can be reduced by raising the rank limit (i.e. toward 1). However, this can result in true polymorphic positions being missed.

    The parameters that are set with the -quality and -window flags affect the length of the regions at the ends of the samples sequences that are excluded from the search for SNPs. These regions appear yellow in Consed. PolyPhred determines the boundary of these regions by calculating the average base quality (as determined by Phred) within a sliding window. The window slides in from the ends of the sequence and stops when the average quality reaches or exceeds a threshold, or quality limit, which the user can set with the -quality flag. The border of the excluded region is then set at the first base within the window with a quality at least 75% of the quality limit. The size of the sliding window can by set by the -window flag. Increasing the quality limit will result in more of the ends being excluded from the search, and in general, reducing the number of positions that are marked polymorphic. Altering the window size results in smaller and less predicable changes in the position of the border. In general, decreasing the window size tends to move the border further inward.

    Changing the quality limit can also affect how individual sites are ranked. Increasing the quality limit can result in reducing the ranks of some sites (i.e, toward 6), and possibly a reduction in the number of positions marked polymorphic.

    The stringency with which PolyPhred compares putative heterozygous sites with the ideal trace can be altered by using the scale flags -scale, -scale1 and -scale2. The -scale1 flag works on the area ratio, and the -scale2 flag works on the normalization ratio. The -scale flag affects both ratios simultaneously.

    PolyPhred uses the scale values to adjust the two ratios before comparing them with the expected ratios. Setting either scale value to a number less than 1 will case PolyPhred to require a tighter match between the actual and ideal traces, and therefore result in fewer SNPs being called. Conversely, setting a scale value to a number greater than 1 will result in more SNPs being called.


    The Output Report

    The new flag -o allows the user to specify the file name of the output report as a command line option rather than by redirecting the standard output. If the user uses the same file name for all output reports, the -o flag can be put in a .polyphredrc file (see Customizing PolyPhred). If no output file name is given, the default output file name is used. If no path is specified, the file is written to the edit_dir directory of the working directory. If the the -o flag is not used, the report will be written to standard output as usual, and can be redirected to a file.

    To facilitate parsing of the output report, the report is divided into several blocks. Each block begins with the token BEGIN_BLOCKNAME and ends with END_BLOCKNAME, where BLOCKNAME is the name of the block.

    The output report begins with the line BEGIN_MESSAGE and ends with the line END_MESSAGE. The first block within the report is the HEADER block. This block provides the version of PolyPhred that generated the output report, a thumbprint to uniquely identify the output, the date and time the output was generated, and the directory from which PolyPhred was run.

    Next is the COMMAND_LINE block. In this block are listed the user-definable parameters that the users needs to interpret the output report, and to repeat the analysis if needed. This includes the working directory and the ACE file that was used, and those parameters that affect the analysis.

    The rest of the report contains results for one or more contigs. The contigs must be valid (i.e. contain more than one valid sample sequence) to appear in the report. The results for each contig are enclosed within the lines BEGIN_CONTIG and END_CONTIG. The line immediately following the BEGIN_CONTIG token provides the name of the contig. The results are then subdivided into several blocks that describe below. The user can specify which blocks actually appear in the output report by using the -block flag.

    If the -ref flag is used, the position relative to a reference sequence is written in the second field immediately after the consensus sequence position in the POLY, GENOTYPE, COLUMNGENOTYPE, MANUALGENOTYPE, VERIFIED and INDEL blocks.

    The POLY block
    The positions where SNP sites were identified are listed in this block. Each line reports the consensus sequence position, 5' sequence flanking the polymorphic site, the two major alleles found at the position, 3' sequence flanking the polymorphic site, and the over-all rank assigned to the position.

    The GENOTYPE block
    For each position where a SNP site was identified, the sample sequences that cover the position are listed. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the rank.

    The COLUMNGENOTYPE block
    For each manually-tagged position on the consensus sequence, the sample sequences that cover the position are listed. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the rank.
    This is a new block.

    The INDEL block
    If the -indel flag is used, a list of the identified insertion/deletion sites will be listed in the INDEL block. Each line reports the consensus sequence position, the position relative to the sample sequence in which the indel was found, the name of the sample sequence, and the size of the indel. A positive size indicates an insertion, and a negative size indicates a deletion.
    This is a new block.

    The MANUALGENOTYPE block
    Sample sequence sites that have been tagged manually are listed in this block. PolyPhred obtains the user-defined tags from the .polyphredrc file (see Customizing PolyPhred). Each line reports the consensus sequence position of a tagged site, the position relative to the sample sequence that was tagged, and the identity of the tag.
    This is a new block.

    The VERIFIED block
    Positions on the consensus sequence that have been manually marked as verified sites are listed in this block. PolyPhred obtains the user-defined tags from the .polyphredrc file (see Customizing PolyPhred). Each line reports the consensus sequence position and the tag identity.
    This is a new block.

    The SAMPLE block
    The names of the sample sequences that were analyzed and their sequence qualities are listed in this block. Each line reports the name of a sequence, the left and right limits of the region searched by PolyPhred, and the average site quality, as determined by Phred, within the search region. The limits of the search region are calculated using the -quality and -window parameters and are indicated by the yellow regions at the ends of the sample sequences in Consed.

    The COVERAGE block
    This block provides a tally of the number of sample sequences that PolyPhred examined at each position. Each line reports the begin and end positions of a range relative to the consensus sequence, followed by the number of sample sequences that were analyzed in the range.
    This is a new block.


    Tracking Manually-applied Tags

    Consed allows the user to create custom tags that can be manually applied to the consensus sequence and sample sequences. Manual tags can be used to mark special sites or regions and provide information about them, or to override calls made by Phred or PolyPhred. Consed stores these tags in the ACE and PHD files.

    PolyPhred is able to capture three types of manual tags and provide information about them in the output report. To activate the tag-capturing feature, it is necessary to create a .polyphredrc file and indicate in the file the tag types to be captured (see Customizing PolyPhred).

    The manualtag type
    This tag type is used to mark sites in sites in sample sequences in sample sequences. Typically this is done to modify or override the genotype call made by Phred or PolyPhred. The captured tags are listed in the MANUALGENOTYPE block.

    The verifiedtag type
    This tag type is applied to the consensus sequence to indicate positions verified as polymorphic by an analyst. The captured tags are listed in the VERIFIED block.

    The columntag type
    This tag type is applied to positions on the consensus sequence and is used to force PolyPhred to genotype the column of sites at those positions. This is done in addition to the normal search and genotyping function that PolyPhred performs. Sites that are genotyped under a one the these manual tags are listed in the COLUMNGENOTYPE block.


    Detection of Insertion/Deletion Polymorphisms

    Searching for insertion/deletion (indel) polymorphisms is a new optional feature in this version of PolyPhred. When PolyPhred locates a putative indel site within a sample sequence, it excludes the sequence downstream of the indel site from the search for SNP sites.

    The indel search algorithm is still under development, and there is room for improvement in future versions. Currently, PolyPhred identifies only those sample sequences that appear to be heterozygous for an indel. Sequences that are homozygous for an indel relative the consensus sequence are not marked. Also, PolyPhred tends to under-report the presence of indels. For these reasons, the indel search option is set to off by default, and can be activated by using '-i on' in the command line. However, users who wish to change the default setting to 'on' can do so in the .polyphredrc file.

    When PolyPhred identifies an indel site, it inserts an 'indelSite' tag in the ACE file. Sample sequences containing the indel are marked with a 'heterozygoteIndel' tag, while those that do not are marked with a 'homozygoteIndel' tag. Indels are also identified as either diallelic or multiallelic in the tag comment.

    Versions of Consed prior to version 13.0 are not able to interpret the indel tags. To solved this problem, it is necessary to modify the .consedrc file. Add the following lines to the .consedrc file:

      consed.customConsensusTag1: indelSite
      consed.tagColorCustomConsensusTag1: DarkCyan
      consed.customTag1: indel
      consed.tagColorCustomTag1: DarkOrange
    

    If the 'customConsensusTag1' and 'customTag1' tags are already used, change the final number 1 in the tag names to the next available number.


    Installing PolyPhred

    1. Make sure the following programs are installed:
        phred      version 0.961028 or later
        phrap      version 0.960731 or later
        phd2fasta  version 0.971024 or later
        phredPhrap
        consed     version 8.0 or later
      

    2. Download the PolyPhred package for the appropriate platform. Put the file in a directory where it is to be unpacked.

    3. Run "gunzip polyphred.tar.gz". This should produce a file called "polyphred.tar".

    4. Run "tar xvf polyphred.tar". This should produce the following files and directories:
       polyphred              the PolyPhred program
       sudophred              tool for making PHD and POLY files
       polyphred.html         this document
      

    5. Move or copy the "polyphred" and "sudophred" files to the directory from which they will be run, such as
      /usr/local/genome/bin/

    6. Edit the phredPhrap file as follows:

      1. Uncomment (remove the # from) the line
        # $polyPhredExe = "/usr/local/genome/bin/polyphred";
        Make sure the path within the quotes matchs the directory in the previous step.

      2. Change the 0 to 1 in the line
        $bUsingPolyPhred = 0;

    7. Read the section Customizing PolyPhred, as well as the section Detection of Insertion/Deletion Polymorphisms for instructions on customizing Consed.


    Running PolyPhred

    Phred, Phrap, Consed and PolyPhred all require a fixed directory structure for analyzing sequence data. If a gene to be analyzed is called "mygene", for example, the directory structure should look like this:

      mygene/
    
    containing the subdirectories:
      chromat_dir/
      edit_dir/
      phd_dir/
      poly_dir/
    

    1. For each set of data to be analyzed, create this directory structure.

    2. Move or copy the chromat (trace) files to the chromat_dir directory.

    3. From the edit_dir directory, run "phredPhrap mygene" where "mygene" is the name of the data to be analyzed. This program runs the programs Phred and Phrap consecutively. When the process is complete, there should be several files in the edit_dir, including one with the extension .ace.1 (the ACE file), and several files in the phd_dir and poly_dir directories.

    4. View the assembled sequences in Consed. Further assembly of the data might be required. For information on this process, check the Consed documentation.

    5. Run "polyphred". Include any desired flags on the command line.

    6. Use Consed to view the polymorphic sites.


    Using the sudophred Tool

    The sudophred program can be used to generate fake chromat, PHD and POLY files from sequence data for which their are no chromat files. This is necessary when one wants to use PolyPhred to search for mismatches in such sequence data. The program can also be used to generate a reference sequence to be included in the PolyPhred analysis.

    One or more sequences can be included in a single file. The sequences must be in the FASTA format. If the reference sequence option is not used, sudophred writes a chromat file, a PHD file and a POLY file for each sequence in the FASTA file. The names of the files are based on the string that follows the '>' character at the top of the sequence. The PHD files have the ".phd.1" extension appended to the end of the string, and the POLY files have the ".poly" appended to the end of the string. If the reference sequence option is used, sudophred writes only one chromat file and PHD file, using the first sequence in the FASTA file.

    If one runs sudophred from one of the subdirectories in the directory structure described above (for example, from the edit_dir directory), sudophred writes the chromat files into the chromat_dir directory, the PHD files into the phd_dir directory, and the POLY files into the poly_dir directory. If sudophred is run in some other directory, the files are written into this same directory.

    To generate chromat, PHD and POLY files, sudophred must be run with the name of the FASTA file as the first argument. The file name can be followed by one or both of two flags, -q and -r.

    The -q flag must be followed by an integer number in the range 0 to 59. This value specifies the quality that sudophred will give to each base in the PHD file. If the -q flag is not supplied, sudophred will use the default value 59. When the sequence is viewed in Consed, the quality value will determine how gray the sequence appears.

    To generate a reference sequence, use the -r flag. The flag can be followed by a string that specifies the reference sequence identifier. If this string is not supplied, sudophred uses the default string ".REF". The reference identifier appears in the name of the chromat and PHD files, just prior to the ".phd.1" extension.

    To display a help message, run "sudophred -h" or "sudophred -help".


    Using a Reference Sequence

    It is sometimes desirable to assemble sample sequences onto a known, or reference sequence, and to have PolyPhred report polymorphic positions relative to that sequence. To do this, one needs to create a fake PHD file for the reference sequence, and then run Phred and Phrap in the presence of this file. The process of creating a reference PHD file, assembling it with sample data, and reporting reference sequence positions with PolyPhred are described below:

    1. Set up the directory structure described above, with chromat files in the chromat_dir directory, and the edit_dir, phd_dir and poly_dir directories empty.

    2. Obtain the reference sequence in FASTA format. Put the FASTA file in the edit_dir directory.

    3. In the edit_dir directory, run sudophred (supplied with polyphred) as follows:

      sudophred [filename] -r
      or
      sudophred [filename] -r [reference identifier]

      where [filename] is the name of the FASTA file. The sudophred program writes a PHD file for the reference sequence into the phd_dir directory. If there is more than one sequence in the FASTA file, sudophred uses only the first sequence. The name of the PHD file is based on the string that follows the '>' character at the top of the sequence. The reference identifier is appended to the end of the string, followed by the extension ".phd.1". If the reference identifier is not supplied, then the default string ".REF" is used.

    4. Run "phredPhrap". If the results are examined in Consed, the reference sequence should appear together with the sample sequences.

    5. Run "polyphred -ref". Include the reference sequence identifier if it is different from the default. Polymorphic positions will appear in the second column within the POLY, GENOTYPE, COLUMNGENOTYPE, MANUALGENOTYPE, VERIFIED and INDEL blocks of the output report.


    Searching for Mismatches in Non-trace Sequence Data

    It is possible to use PolyPhred to search for mismatches between sequences for which there is no chromat files, or to compare such files with chromat data. The steps are similar to creation of a reference sequence.

    1. Set up the directory structure described above, with chromat files in the chromat_dir directory, and the edit_dir, phd_dir and poly_dir directories empty.

    2. Obtain file containing one or more sequences in FASTA format. Put the FASTA file in the edit_dir directory.

    3. In the edit_dir directory, run sudophred (supplied with polyphred) as follows:

      sudophred [filename]

      where [filename] is the name of the FASTA file. The sudophred program writes a PHD file and a POLY file for each sequence in the FASTA file. The names of the files are based on the string that follows the '>' character at the top of the sequence. The PHD files have the ".phd.1" extension appended to the end of the string, and the POLY files have the ".poly" appended to the end of the string.

    4. Run "phredPhrap".

    5. Run "polyphred".


    Customizing PolyPhred

    PolyPhred can be customized to suit the preferences of the user by creating a .polyphredrc file. The .polyphredrc file allows the user to change default parameter values, as well as specify any manual tags that PolyPhred should capture and written in the output report. This file is optional, and if it is not present, PolyPhred will used its built-in default parameter values and will not capture manual tags.

    When PolyPhred starts, it looks for a .polyphredrc file in three locations. It first looks in the user's current directory. If the file is not found there, PolyPhred looks in the user's home directory. If the file is still not found, PolyPhred looks for a directory in the user's shell rc file. The directory is specified by including in the shell rc file the line:

      setenv POLYPHRED_PATH [path]
    
    where [path] is the directory containing the .polyphredrc file.

    The default values of any parameter can be changed in the .polyphredrc file. To change a default value, enter the key word "flag", the command-line flag for the parameter and the new default value. Each entry must be on a separate line. For example, to change the default value for the rank limit to 2 and the quality limit to 25, enter these two lines in the .polyphredrc file:

      flag -r 2
      flag -q 25
    

    If the "flag" key is used to change the default file names for the -nav and -output flags, or the default reference identifier for the -ref flag, these features will also be turned on by default. To make these changes while keeping the features off by default, a different set of key words must be used:

      outputfile [file name]
      navfile [file name]
      refID [identifier]
    
    The first line changes the default file name of the output report, the second line changes the default file name of the navigation file, and the third line changes the default reference sequence identifier.

    To specify the manual tags that are applied to sample sequences, list each tag on a separate line, preceded by the key word "manualtag". Sites marked with these tags will be listed in the MANUALGENOTYPE block of the output report.

    To specify the manual tags that mark verified positions in the consensus sequence, list each tag on a separate line, preceded by the key word "verifiedtag". Positions marked with these tags will be listed in the VERIFIED block.

    To specify the manual tags that mark columns for forced genotyping, list each tag on a separate line line, preceded by the key word "columntag". Genotype information for each sample sequence covering the tagged positions will be listed in the COLUMNGENOTYPE block.

    The date that appears at the top of the output report can be changed from the "day/month/year" format, which is the default, to the "month/day/year" format. To do this, put this line in the .polyphredrc file:

      date MDY
    

    Blank lines are permitted. In addition, a line that begins with the character # is treated as a comment.

    An example of a .polyphredrc file:

      date MDY
      flag -q 25
      flag -f 16
    
      outputfile report.txt
      refID .refSeq
    
      # Manual Tags
      verifiedtag polymorphism
      columntag   manualGenotype
      manualtag   heterozygote
      manualtag   homozygote
      manualtag   indel
    


    Who to Contact with Questions and Problems

    If you have questions or problems with Phred, Phrap or Consed, or you need to obtain these programs, please see the web site at:
    http://www.phrap.org

    If you have questions or problems with PolyPhred, please

    1. read this documentation carefully;

    2. go to this web site: http://droog.gs.washington.edu

      Follow the "PolyPhred" link for the email address of the person to contact. Please do not email questions to the web master.

    If you discover an error in PolyPhred, please follow step 2 above.


    References

    1. Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A., 1994
       "Comparative analysis of human DNA variations by fluorescence-based sequencing 
       of PCR products", Genomics 25, 615-622.
    
    2. Nickerson, D.A., Tobe, V.O., and Taylor, S.L, 1997, "Polyphred: automating the 
       detection and genotyping of single nucleotide substitutions using fluorescence-based 
       resequencing", Nucleic Acids Research, 25: 2745-2751.
    
    3. Ewing, B., Hillier, L., Wendl, M.,  and Green, P., 1998, "Basecalling of automated 
       sequencer traces using phred.  I. Accuracy assesment", Genome Research 8: 175-185.
    
    4. Ewing, B. and Green, P., 1998, "Basecalling of automated sequencer traces using 
       phred.  II. Error probabilities", Genome Research 8: 186-194.  
    
    5. Green, P., 1994, Phrap, unpublished.
       http://www.genome.washington.edu/UWGC/analysistools/phrap.htm
    
    6. Gordon, D., Abajian, C., and Green, P., 1998, "Consed: A grapical tool for sequence 
       finishing", Genome Research 8:195-202.