Documentation for PolyPhred Version 4.2

Last modified: 2005/04/21

Program: PolyPhred
Version: 4.2
Copyright (C) 2002-2004
by Deborah A. Nickerson, Scott Taylor, Natali Kolker and Jim Sloan
University of Washington

All rights reserved.

This software is part of a test version of the PolyPhred distribution package. It may not be redistributed, distributed in modified form, or used for any commercial purpose, including commercially funded sequencing, without written permission from the authors and the University of Washington.

This software is provided "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.


Contents

  1. Description of Features
  2. Setup and Operating Instructions
  3. More Information


Introduction

Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA sequence variation in the human genome. The identification and typing of these variations plays a central role in analyzing the relationships between genome structure and function, and in understanding the allelic variation within and among populations.

Many techniques are used to identify sequence variants among different individuals using DNA amplified by the polymerase chain reaction (PCR). These include denaturing gel electrophoresis, chemical or enzymatic cleavage, heteroduplex analysis, the analysis of single-stranded DNA conformations, variant detector arrays, and direct sequencing of a PCR product. PolyPhred is a program that helps to accurately identify heterozygous sites in sequences produced by sequencing PCR products with fluorescence-based chemistries such as dye labeled terminators or dye-labeled primers. The program compares sequence traces and searches for homozygotes and heterozygotes.

Identification of potential heterozygous sites is based on 1) the presence of two significant overlapping fluorescence peaks at such sites in the sequence trace, and 2) detecting a decrease of about 50% in the peak heights when the sequence trace is compared with that obtained from homozygous individuals (references 1 and 2). PolyPhred scans for these two features when sequence traces are being compared to detect heterozygotes among homozygotes (reference 2).

PolyPhred is not a stand alone program. It is designed as a member of an integrated suite of sequence analysis applications that includes the programs Phred (references 3,4), Phrap (reference 5), and Consed (reference 6).


How PolyPhred works

PolyPhred identifies potential heterozygous sites by comparing traces in a sequence assembly. Phred provides the base-calls, base quality information and the peak size information, which is stored in two types of files called PHD and POLY files. Phrap is used to assemble the input sequences into one or more contigs, and to derive a consensus sequence for each contig. The assembly information is stored in a file called the ACE file. PolyPhred uses all three file types to analyze the sequence traces. It first reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. It then reads the PHD and POLY files associated with each trace.

During the analysis phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence (see How PolyPhred scores SNP sites). It also uses a standard sequence for comparison to identify sites that are homozygous for a minor or alternative allele. The score indicates how well the trace at the site matches the expected pattern for a SNP. After PolyPhred identifies the putative polymorphic sites, it updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using the program Consed. PolyPhred also generates a detailed output that lists the positions, genotypes and scores of the polymorphic sites in a format that can be easily parsed into a database program.


What is new in PolyPhred Version 4.2


The flags

The operation of PolyPhred can be controlled using command-line flags. In the descriptions of the flags below, differences between version 4.0 and 4.2 are indicated in red. Note that the -scale flags are no longer functional in version 4.2, and that -s now stands for the -source flag.

Many of the flags have an abbreviated form, which are shown in parentheses. Most of the flags take an argument, which is shown in square brackets ([ ]). For some flags, the argument is optional. In these cases, the argument is indicated in green, and a default value if the argument is omitted is shown.

All of the flags are optional. Each description indicates the argument value or action taken if the flag is omitted.

-ace (-a) [ace file]
Use this flag to specify the ACE file to be read by PolyPhred.
•  If omitted: the most recent ACE file is used.

-block [list of block names]
Use this flag to include or exclude blocks from the output file. The valid block names are POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED, MICROSATELLITE, SAMPLE and COVERAGE. To include a block, precede the block name with a plus sign (+). To exclude a block, precede the block name with a minus sign (-). For example, to exclude the SAMPLE and COVERAGE blocks from the output report, add this to the command line:

  -block -SAMPLE -COVERAGE
•  If omitted: all blocks except MICROSATELLITE are included in the output.

-clear
Use this flag to remove all PolyPhred tags from the ACE and PHD files.
•  If omitted: normal operation

-dir (-d) [work directory]
Use this flag to specify the location of the data. The flag allows PolyPhred to be run from a directory other than the one containing the data to be analyzed (see
Running PolyPhred).
•  If omitted: PolyPhred must be run either from the edit_dir directory or from the data directory of the data to be analyzed.

-flanking (-f) [number]
Use this flag to specify the number of bases flanking the polymorphic sites reported in the POLY and POLYINDEL blocks of the PolyPhred output.
•  Accepted numbers: 0 - 50
•  If omitted: 10

-group (-g) [regular expression]
This flag specifies a subset of the files to be used in the analysis. PolyPhred analyzes only those sequences with a name that matches the regular expression.
•  If omitted:  .+

-help (-h)
Use this flag to see information on how to use PolyPhred. The flags are listed along with their allowed and default values.
•  If omitted: normal operation

-indel (-i) [on / off]
Use this flag to switch on or off the search for indel polymorphisms. See Detection of insertion/deletion polymorphisms.
•  Default argument: on
•  If omitted: off

-ms [on / off]
Use this flag to switch on or off the marking of simple microsatellite repeats.
•  Default argument: on
•  If omitted: off

-nav (-n) [file name / on / off]
Use this flag to generate a navigation file listing the polymorphic sites. If the file name is given but does not have a final ".nav" extension, PolyPhred adds one. The file is written to the edit_dir directory of the working directory.
•  Default argument: on, using the file name "polyphred.nav"
•  If omitted: off

To use the navigation file in Consed, click on 'Navigate', located at the top of the 'Consed Main Window'. Then click on 'Custom Navigation'. The window that appears should contain the name of the navigation file. Click on the file name to bring up the navigation window.

-output (-o) [file name / on / off]
Use this flag to send the PolyPhred output either to a file or to the standard output (the screen). If the argument is "off", the output is written to the screen. In this case, the output can be redirected to a file using '>'.
•  Default argument: on, using the file name "polyphred.out"
•  If omitted: off

-quality (-q) [value]
Use this flag to set the quality limit. PolyPhred uses the quality limit to determine the extent of the excluded, or trimmed, regions at the ends of the sample sequences (the regions shaded in yellow when the assembly is viewed in Consed). Reducing this value results in less trimming of the ends. See Reducing the false-positive rate.
•  Accepted value: 0 - 50
•  If omitted: 30

-rank (-r) [value / on / off]
Use this flag to direct PolyPhred to score sites with the six-point ranking system. To set the rank limit, follow the flag with a number from 1 to 6. PolyPhred marks and reports only sites that are assigned a rank between 1 and the rank limit, inclusive. See Reducing the false-positive rate.
•  Accepted value: 1 - 6
•  Default argument: on, using the value 3
•  If omitted: the 100-point scoring system is used.

-ref [reference sequence identifier / on / off]
Use this flag to specify a reference sequence for reporting of polymorphic site positions. In this case, PolyPhred uses the consensus sequence as the standard, rather than the reference sequence (see -refcomp below). See Using a reference sequence.
•  Default argument: on, using the identifier ".REF"
•  If omitted: off

-refcomp [reference sequence identifier / on / off]
Use this flag to direct PolyPhred to use a reference sequence as the standard rather than the consensus sequence. See Using a reference sequence.
•  Default argument: on, using the identifier ".REF"
•  If omitted: off
•  This is a new flag.

-source (-s) [/delimiter / posn1 posn2 / off]
Use this flag to activate the source genotype resolution function and set the location in the chromat file names of the source identifier, or turn the function off. The source identifier is the series of characters that uniquely identifies the source of the DNA sample. PolyPhred uses the source identifier to match sequences from the same DNA sample. See Reducing the false-positive rate.

The source identifier can be placed in the chromat file names in either of two methods. One method is to flank the identifier characters with a delimiter. Any valid file name character can serve as the delimiter. When running PolyPhred, indicate the delimiter as follows ('c' is the delimiter character):

  polyphred -s /c
For example, if the chromat file names are of the form: abc-source-xyz.scf
run PolyPhred as:
  polyphred -s /-

The second method for locating the source identifier is to place the identifier characters in a constant location in all chromat file names. Indicate the location of the identifier characters as follows:

  polyphred -s posn1 posn2
For example, if all chromat file names are of the form: abcSOURCExyz.scf
where SOURCE is the location of the identifier characters, from positions 4 to 9, then run PolyPhred as follows:
  polyphred -s 4 9

If the function has been activated in the .polyphredrc file, it can turned off with the 'off' argument.
•  If omitted: off
•  This is a new flag.

-score [number]
Use this flag to select the 100-point scoring system and set the score limit. PolyPhred marks and reports only sites that are assigned a score between 99 and the score limit, inclusive. See Reducing the false-positive rate.
•  Accepted numbers: 0 - 99
•  If omitted: the 100-point scoring system is used with a limit of 20
•  This is a new flag.

-snp [het / hom / on / off]
Use this flag to switch on or off SNP detection, or to select either the marking of heterozygous (het) or homozgous (hom) polymorphisms only.
•  Default argument: on, marking both heterozygous and homozygous polymorphisms
•  If omitted: on
•  This is a new flag.

-tag (-t) [tag type]
Use this flag to specify the tag type with which SNP sites viewing in Consed. The three tag types are "genotype", "polymorphism", and "rank". The tag types can be abbreviated as g, p and r, respectively. Using the genotype tag results in putative polymorphic sites marked on the consensus sequence with color-coded tags indicating rank, and putative SNPs marked with pink tags on the sample sequences. Using the rank tag results in color-coded tags indicating rank placed on both the consensus sequence and the sample sequences (see How PolyPhred scores SNP sites for the color codes.) Using the polymorphism tag results in a blue tag placed on all putative polymorphic sites on the consensus sequence and pink tags indicating putative SNPs on the sample sequences.
•  If omitted: genotype

-update [on / off]
Use this flag to control updating of the ACE and PHD files. If updating is switched off, the ACE and PHD files are not updated, and the PolyPhred results can not be viewed in Consed.
•  Default argument: on
•  If omitted: on

-verbosity (-v) [0 / 1 / 2]
Use this flag to set the level of status reporting that will written to the screen as PolyPhred is running. The allowed arguments range from 0 (least reporting) to 2 (most reporting).
•  If omitted: 0

-version
Use this flag to see the PolyPhred version and build number.
•  If omitted: normal operation

-window (-w) [number]
Use this flag to set the window width. PolyPhred uses the window width, together with the quality limit, to determine the extent of the excluded, or trimmed, regions at the ends of the sample sequences (the regions shaded in yellow when the assembly is viewed in Consed).
•  Accepted numbers: 5 - 50
•  If omitted: 40

-xml [on / off]
Use this flag to specify the format of the PolyPhred output.
•  Default argument: on
•  If omitted: off
•  This is a new flag.


How PolyPhred scores SNP sites

A SNP site generally appears in the sequence traces as two overlapping peaks with reduced peak heights. Ideally, the areas under these two peaks are nearly the same, and the heights of the peaks are reduced by about a half of what the height of a hypothetical homozygous peak would be at the same position. PolyPhred derives this hypothetical homozygous peak height by comparing each trace with an average of all the traces.

When PolyPhred identifies a putative heterozygous site in a sample sequence, it assigns the site a score that indicates how well the traces of the two peaks fit the ideal pattern for a SNP. The scores range from 99 to 0, with 99 indicating a very good fit.

If a site is determined to be homozygous, PolyPhred compares its genotype with that of a standard sequence, which can be either the consensus sequence or a user-specified reference sequence. If the genotypes do not match, the site is marked as a minor or alternative allele.

When all sites at a position (i.e., a column as viewed in Consed) have been assigned a score, PolyPhred calculates an overall score and genotype for the position. This score depends on the highest-scoring site in the sample sequences, with additional points given if potential SNPs appear on both strands. If the overall score is greater than or equal to the score limit (see the -score flag), then PolyPhred marks the position as polymorphic. The number of sites that PolyPhred marks can be controlled by adjusting the score limit (see Reducing the false-positive rate).

If the six-point ranking system is selected, PolyPhred converts the score to a rank according to the table below. The table also shows the tag colors used to display the rank and genotype tags in Consed.

ScoreRankTag Color
99-901red
89-602orange
59-203green
19-104dark blue
9-55magenta
4-06purple


The output report

To facilitate parsing of the output file, the report is divided into several blocks. Each block begins with the token BEGIN_BLOCKNAME and ends with END_BLOCKNAME, where BLOCKNAME is the name of the block.

The output report begins with the line BEGIN_MESSAGE and ends with the line END_MESSAGE. The first block within the report is the HEADER block. This block provides the version of PolyPhred that generated the output report, a thumbprint to uniquely identify the output, the date and time the output was generated, and the directory from which PolyPhred was run.

Next is the COMMAND_LINE block. Listed in this block are the user-definable parameters that the users needs to interpret the output report, and to repeat the analysis if needed. This includes the working directory and the ACE file that was used, and those parameters that affect the analysis.

The rest of the report contains results for one or more contigs. The results for each contig are enclosed within the lines BEGIN_CONTIG and END_CONTIG. The line immediately following the BEGIN_CONTIG token provides the name of the contig. The results are then subdivided into several blocks that describe below. The user can specify which blocks appear in the output report by using the -block flag.

If the -ref flag is used, PolyPhred adds an addition field in the POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED and MICROSATELLITE blocks. The extra field, which comes second after the consensus sequence position, is the position relative to a reference sequence.

The POLY block
In this block, the putative SNP sites identified by PolyPhred are listed, as well as sites marked by columntag type tags (see
User-defined manual tags). Each line reports the consensus sequence position, the 5' sequence flanking the polymorphic site, the two most common alleles at the site, the 3' sequence flanking the site, and the over-all score assigned to the site.
•  XML tag: block-snp_site    subtag: snp_site

The GENOTYPE block
In this block, the genotypes of the individual sample sequences are listed for each putative SNP site the POLY block. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the score.
•  XML tag: block-snp_genotype    subtag: snp_genotype

The COLUMNGENOTYPE block
In this block, the genotypes of the individual sample sequences are listed for each manual-SNP tag applied to the consensus sequence. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the score. PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
•  XML tag: block-manual_snp    subtag: snp_genotype

The INDEL block
If the -indel flag is set to 'on', the putative indel sites are listed in this block. Each line reports the consensus sequence position, the position relative to the sample sequence in which the indel was found, the name of the sample sequence, the genotype ('+-' indicates a heterozygote, '--' indicates a homozygous deletion), and the length of the indel.
•  XML tag: block-marked_indel    subtag: marked_indel

The POLYINDEL block
In this block, the manual-indel tag sites applied to the consensus sequence are listed. Each line reports the consensus sequence position, the 5' sequence flanking the indel site, the segment involved in the indel, the 3' sequence flanking the site, and the comment if one is present. PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
•  XML tag: block-indel_site    subtag: indel_site
•  This is a new block.

The COLUMNINDEL block
In this block, the genotypes of the individual sample sequences are listed for each manual-indel tag listed in the POLYINDEL block. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, and the genotype. The tag used to specify the genotype can be user-defined in the .polyphredrc file (see User-defined manual tags).
•  XML tag: block-manual_indel    subtag: manual_indel
•  This is a new block.

The MANUALGENOTYPE block
In this block, Sample sequence sites that have been tagged manually are listed. Each line reports the consensus sequence position of a tagged site, the position relative to the sample sequence that was tagged, the identity of the tag, and the comment if one is present.
PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
•  XML tag: block-manual_genotype    subtag: manual_genotype

The VERIFIED block
In this block, sites manually tagged as verified are listed. Each line reports the consensus sequence position and the tag identity. PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
•  XML tag: block-verified_site    subtag: verified_site

The MICROSATELLITE block
If the -ms flag is set to 'on', this block lists that microsatellite sequences that were found. Each line reports the consensus sequence position of the 5' end of the microsatellite and the repeat pattern.
•  XML tag: block-microsatellite    subtag: microsatellite
•  This is a new block.

The SAMPLE block
The names of the sample sequences that were analyzed and their sequence qualities are listed in this block. Each line reports the name of a sequence, the positions of the left and right boundaries of the search region (between the trimmed ends), and the average site quality, as determined by Phred, within the search region.
•  XML tag: block-sample_quality    subtag: sample_quality

The COVERAGE block
This block provides a tally of the number of sample sequences that PolyPhred examined at each position. Each line reports the begin and end positions of a range relative to the consensus sequence, followed by the number of sample sequences that were analyzed in the range.
•  XML tag: block-coverage    subtag: coverage


Detection of insertion/deletion polymorphisms

The indel detection feature is activated using the -indel flag. PolyPhred identifies sample sequences with putative heterozygous indels, as well as sequences with a deletion greater that two bases relative to the consensus sequence.

When PolyPhred identifies an indel site, it marks it on the consensus sequence with an 'indelSite' tag. Sample sequences containing the indel are marked with a 'heterozygoteIndel' tag, while those that do not are marked with a 'homozygoteIndel' tag.

PolyPhred is sometimes inaccurate in determining the positions and lengths of indels. Therefore, a manual tagging system is provided for marking the correct positions and lengths of indels. The corrected positions and lengths will be reported in the PolyPhred output (see User-defined manual tags).

Versions of Consed prior to version 13.0 are not able to interpret the indel tags. To solved this problem, it is necessary to modify the .consedrc file. Add the following lines to the .consedrc file:

  consed.customConsensusTag1: indelSite
  consed.tagColorCustomConsensusTag1: DarkCyan
  consed.customTag1: indel
  consed.tagColorCustomTag1: DarkOrange

If the 'customConsensusTag1' and 'customTag1' tags are already used, change the final number 1 in the tag names to the next available number.


User-defined manual tags

One of the features available in the Consed program is the ability to create custom tags. These tags can be used to mark or highlight specific sites or regions on the consensus sequence or on individual sample sequences. For example, following analysis by PolyPhred, the user can manually mark putative SNP sites as verified, or change an incorrect genotype. To create custom tags, the user needs to define the tags in the .consedrc file (see the Consed documentation under the Help menu).

PolyPhred can be set to recognized four types of custom tags, and take the appropriate action when they are encountered. This provides a way for the user to pass information from Consed to the PolyPhred output file. For example, PolyPhred can be set to recognized a custom "verified" tag and report sites marked with this tag type in the VERIFIED block of the output file. In addition, two of the custom tag types, columntag and columnindeltag, can be used to force PolyPhred to report genotypes for all sample sequence at specified positions.

For PolyPhred to recognize the tags, they must be listed in the .polyphredrc file (see Customizing PolyPhred). Once the .polyphredrc file has been set up, the typical procedure is to 1) assemble the data, 2) run PolyPhred, 3) use Consed to analyze the results, mark sites and make changes, and 4) run PolyPhred again to obtain both the PolyPhred- and user-generated information in the output file.

The four tag types are:

The manualtag type
Tags of this type is used to mark or edit a site in a sample sequence. Typically these tags are used to change the genotype call made by Phred or PolyPhred. Sites marked with these tags are listed in the MANUALGENOTYPE block.

The verifiedtag type
This tag type is applied to the consensus sequence to indicate that a polymorphic site is verified. Sites marked with these tags are listed in the VERIFIED block.

The columntag type
Tags of this type are applied to the consensus sequence and are used to force PolyPhred to provide SNP genotypes for all of the sample sequences at the tagged sites. Sites marked by these tags are listed in the POLY block, The genotypes in the sample sequence are listed in the COLUMNGENOTYPE block.

The columnindeltag type
Tags of this type is applied to the consensus sequence and are used to force PolyPhred to provide indel genotypes for all of the sample sequences at the tagged sites. The tags can be used to mark the positions and define the length of indel sites. The tag should "cover" the segment involved in the indel so that PolyPhred can report the indel segment in the output. Sites marked by these tags are listed in the POLYINDEL block, and the genotypes in the sample sequences are listed in the COLUMNINDEL block. The name of the tag that marks the site will be used to indicate the homozygous genotype. The heterozygous genotype can be set in the .polyphredrc file with the 'indelhettag' key-word. If this is not set, PolyPhred will indicate heterozygotes with the label 'heterozygoteIndel'.


Installing PolyPhred

  1. Make sure the following programs are installed:
      phred              version 0.961028 or later
      phrap              version 0.960731 or later
      phd2fasta          version 0.971024 or later
      consed             version 13.0 or later
    

  2. Download the PolyPhred package for the appropriate platform. Put the file in a directory where it is to be unpacked.

  3. Run "tar xzf polyphred.tar.gz". This should produce the following files and directories:
      polyphred          the PolyPhred program
      sudophred          tool for making chromat, PHD and POLY files from FASTA files
      polyphred.html     this document
      phredPhrap         perl script for running phred and phrap together in the correct order.
    

  4. Move or copy the polyphred, sudophred and phredPhrap files to the directory from which they will be run, such as
    /usr/local/genome/bin/

  5. If you already have a copy of phredPhrap and wish to keep it, you must open the phredPhrap file and edit it as follows:

    1. Uncomment (remove the # from) the line
      # $polyPhredExe = "/usr/local/genome/bin/polyphred";
      Make sure the path within the quotes matchs the directory in the previous step.

    2. Change the 0 to 1 in the line
      $bUsingPolyPhred = 0;

    3. phredPhrap also contains instructions for running PolyPhred automatically after Phred and Phrap. It is recommended that these lines be inactivated and PolyPhred be run separately. This makes it easier to determine the source when problems occur. To inactivate the lines, removed or comment out the following:

      if ( $bUsingPolyPhred ) {

      print "\n\n--------------------------------------------------------\n";
      print "Now running polyphred for polymorphism detection...\n";
      print "--------------------------------------------------------\n\n\n";

      $szPolyPhredFile = $szBaseName . ".polyphred.out";
      $szPolyPhredFile = $szBaseName . ".fasta.screen.polyphred.out";

      !system( "$polyPhredExe -ace $szAceFileToBeProduced > $szPolyPhredFile" ) ||
      die "some problem running $polyPhredExe $!";

      }

Read the section Customizing PolyPhred, as well as the section Detection of insertion/deletion polymorphisms for instructions on customizing Consed.


Running PolyPhred

PolyPhred reads and modifies data files that are generated by the programs Phred and Phrap, and the can be examined by the program Consed. These programs require the sequence data files to be located in a 'work directory' containing three subdirectories called 'chromat_dir, 'phd_dir' and 'edit_dir'. In addition, PolyPhred needs a fourth subdirectory called 'poly_dir'. It is recommended that a separate working directory be created for each data set. For example, if the data set is called "mydata", a directory called mydata can be created:

  mkdir mydata
Within this directory, create the four subdirectories as follows:
  cd mydata
  mkdir chromat_dir edit_dir phd_dir poly_dir

After these directories have been created, move or copy the chromat files to the chromat_dir directory.

If a reference sequence is to be included in the assembly, use the sudophred tool to generate fake chromat, PHD and POLY files.

To assemble the data, cd to the edit_dir directory and run "phredPhrap mydata". The program phredPhrap automatically runs the programs Phred and Phrap consecutively. When the process is complete, there should be several files in the edit_dir, including one with the extension .ace.1 (the ACE file), and several files in the phd_dir and poly_dir directories.

View the assembled sequences in Consed. Further assembly of the data might be required. For information on this process, check the Consed documentation.

Run "polyphred". Include any desired flags on the command line.

Use Consed to view the polymorphic sites. with PolyPhred (see Customizing PolyPhred).


Reducing the false-positive rate

There are three ways to affect the frequency of false-positive SNP calls made by PolyPhred. The most direct method is by using the -score flag to set the score limit. Only sites that receive a score above this limit are called, so increasing the limit results in fewer calls.

For those using the using the six-point ranking system, increasing the rank limit means setting this value to 2 or 1. This will have the same effect as increasing the score limit to 60 or 90, respectively.

In general, the false-positive SNP call rate tends to increase near the trimmed regions at the ends of a sequence. Therefore, trimming more of the ends will tend to reduce the number of false-positive calls. The length of the trimming is increased by raising the quality limit, which is set with the -quality flag.

The third method for reducing the false-positive rate is to use the source genotype resolution function. This function is effective for data sets in which many of samples that have been sequenced over a region multiple times (sequenced in both directions, for example). When analyzing such data sets, PolyPhred will occasionally make discrepant calls, that is, the genotypes in two sequences from the same sample will disagree at a particular site. In these cases, when the source genotype resolution function is on, PolyPhred will attempt to determine the correct genotype. This function generally improves the accuracy of the genotype determination, and results in fewer false positive calls.

For all of these methods, reducing the number of false-positive calls will also result in an increase in the number of real SNPs that are missed (false negatives). Generally, as one reduces the false-positive rate, the number of false positives that are eliminated is much greater than the number of missed real SNPs. Also, the first real sites that are missed are the rare SNPs, that is, sites with only one or two heterozygotes present in the data set.


Using the sudophred tool

The sudophred program is a tool that can be used to generate fake chromat, PHD and POLY files from DNA sequences in FASTA format. Fake chromat and PHD files are needed if one wishes to include a reference sequence in the assembly of the data set (see Using a reference sequence). Also, if one wants to compare data from sequence trace (chromat) files with text sequences, the text sequences need to be converted into all three file types.

The sudophred program takes one text file as input. The text file can contain one or more sequences in FASTA format. If one is generating fake data files for a reference sequence, sudophred writes data files for the first sequence only. Otherwise, sudophred will generate data files for each of the sequences in the text file. In either case, the names of the data files are taken from the string that follows the '>' at the beginning of each sequence.

One way to run sudophred is to put the FASTA file in an edit_dir directory. Sudophred will write each file that it generates into the appropriate directory. That is, sudophred writes the chromat file in the chromat_dir directory, the phd file in phd_dir, and the poly file in poly_dir. One can also put the FASTA file in an arbitrary directory and run sudophred from there. In this case, sudophred will write all of the files into that same directory. The files must then be moved to the appropriate data subdirectories. In either case, it is easiest to generate the fake data files before running the phredPhrap program that assembles the data into contigs.

By default, sudophred writes all three files. The chromat files are written in SCF format. In the phd files, the quality values are all 59.

To run sudophred, enter:

  sudophred [filename]

where filename is the name of the text file containing the sequences. The file name must always be the first argument.

To use sudophred to generate files a reference sequence, use the -r flag. This flag can be followed by a string that PolyPhred will use to identify the reference sequence. For example:

  sudophred [filename] -r .XYZ
If no string is supplied, sudophred will use the default string .REF

To change the quality score, use the -q flag followed by the score (an integer from 0 to 59). For example:

  sudophred [filename] -q 20

To write the chromat files in ABI format, use the -abi flag.

  sudophred [filename] -abi

To display a help message, run "sudophred -h" or "sudophred -help".

To display version number, run "sudophred -v" or "sudophred -version".


Using a reference sequence

For the purpose of locating SNPs and other features on a standard sequence map, it is useful to include the standard, or reference sequence in the data assembly. One can then run PolyPhred with the -ref flag to obtain the SNP positions relative to that reference sequence. Further more, one might want to have PolyPhred compare the sample sequences with the reference sequence rather than with the consensus sequence that is generated by Phrap. This can be done by running PolyPhred with the -refcomp flag.

When the either the -ref or -refcomp flag is used, PolyPhred reports in the output file two positions rather than one. The blocks displaying this alternate format are the POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED and MICROSATELLITE. In each block, the first number is the position of the feature relative to the consensus sequence, and the second is the position relative to the reference sequence.

To include a reference sequence in the assembly, one should first create the necessary data files from the reference sequence. These files can be generated with the sudophred program supplied with PolyPhred (see Using the sudophred tool).

Use sudophred with the -r flag to generate the reference sequence data files. For example,

  sudophred [filename] -r
where filename is the name of the text file containing the reference sequence in FASTA format. The data files will be given names that begin with the string that follows the '>' at the beginning of the sequence, followed by the default reference identifier ".REF". In this case, one would run PolyPhred with the reference options as follows;
  polyphred -ref

To specify a different reference identifier, follow the -r flag with the identifier string. For example, to set the reference identifier as "xYZ", run:

  sudophred [filename] -r xYZ
In this case, the data files will contain the string "xYZ" in the file names, rather than ".REF", and it will be necessary to select the reference option as follows:
  polyphred -ref xYZ


Customizing PolyPhred

PolyPhred can be customized to suit the preferences of the user by creating a .polyphredrc file. The .polyphredrc file allows the user to change default parameter values, as well as specify any manual tags that PolyPhred should capture and written in the output report. This file is optional, and if it is not present, PolyPhred will used its built-in default parameter values and will not capture manual tags.

When PolyPhred starts, it looks for a .polyphredrc file in three locations. It first looks in the user's current directory. If the file is not found there, PolyPhred looks in the user's home directory. If the file is still not found, PolyPhred looks for a directory in the user's shell rc file. The directory is specified by including in the shell rc file the line:

  setenv POLYPHRED_PATH [path]
where [path] is the directory containing the .polyphredrc file.

Each line in the .polyphredrc file can be either a blank line, a line beginning with a '#' character, indicating a comment, or with one of a set of key-words. The key-words are:
flag, outputfile, navfile, refID, acedir, phddir, polydir, date, verifiedtag, manualtag, columntag, columnindeltag and indelhettag.

The 'flag' key-word can used with any of the command-line flags to change a default value. For example, to will change the default score limit to 30 and the quality limit to 25, enter these lines in the .polyphredrc file:

  flag -score 30
  flag -q 25

The following line

  flag -output out.txt
changes two defaults; it will set the name of the output file to 'out.txt' and cause PolyPhred to write the output in a file with that name rather than to the screen. To change the default file name but keep output to the screen as the default activity, use the 'outputfile' key-word, as:
  outputfile out.txt
Then, to use the new default output file name, run "polyphred -o on".

Similarly, the both lines below change the default name of the navigation file, but the first line causes PolyPhred to write a navigation file by default, while the second line leaves the default activity off. flag -nav [file name] navfile [file name]

All three lines below change the default reference sequence identifier. The first two lines turn their functions on by default, while the third line leaves the default activities off.

  flag -ref [identifier]
  flag -refcomp [identifier]
  refID [identifier]

The 'acedir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to directories other than the ones that required by Phred, Phrap and Consed. The 'acedir' sets the location of the ace file (which is normally in the edit_dir directory). The 'phddir' and 'polydir' key-words specify the locations of the phd and poly files, respectively. A directory are considered to be within the work directory, unless an absolute path is given (starts with a '/'). Use a '.' to indicate that a directory is the same as the work directory.

The 'date' key-word allows the user to set the format of the date that appears at the top of the output file. The key-word must be followed by one of six format codes:
2-digit year4-digit yearformat
DMYDMYYday/month/year
MDYMDYYmonth/day/year
YMDYYMDyear/month/day
The default is the DMY format.

Four of the key-words set tag names for the four tag types (see User-defined manual tags). Each tag type can have more than one name (see the example .polyphredrc file below). In addition, the indelhettag key-word allows the user to specify the tag that will be used to indicate heterozygous indels.

Here is an example of a .polyphredrc file:

  date YYMD
  flag -q 25
  flag -f 16

  outputfile report.txt
  refID .refSeq

  # Manual Tags
  verifiedtag    polymorphism
  columntag      manualGenotype
  columnindeltag indel:++
  columnindeltag indel:--
  indelhettag    indel:+-
  manualtag      heterozygote
  manualtag      homozygote
  manualtag      indel


Who to contact with questions and problems

If you have questions or problems with Phred, Phrap or Consed, or you need to obtain these programs, please see the web site at:
http://www.phrap.org

If you have questions or problems with PolyPhred, please

  1. read this documentation carefully;

  2. go to this web site: http://droog.gs.washington.edu

    Follow the "PolyPhred" link for the email address of the person to contact. Please do not email questions to the web master.

If you discover an error in PolyPhred, please follow step 2 above.


References

1. Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A., 1994
   "Comparative analysis of human DNA variations by fluorescence-based sequencing 
   of PCR products", Genomics 25, 615-622.

2. Nickerson, D.A., Tobe, V.O., and Taylor, S.L, 1997, "Polyphred: automating the 
   detection and genotyping of single nucleotide substitutions using fluorescence-based 
   resequencing", Nucleic Acids Research, 25: 2745-2751.

3. Ewing, B., Hillier, L., Wendl, M.,  and Green, P., 1998, "Basecalling of automated 
   sequencer traces using phred.  I. Accuracy assesment", Genome Research 8: 175-185.

4. Ewing, B. and Green, P., 1998, "Basecalling of automated sequencer traces using 
   phred.  II. Error probabilities", Genome Research 8: 186-194.  

5. Green, P., 1994, Phrap, unpublished.  http://www.phrap.org

6. Gordon, D., Abajian, C., and Green, P., 1998, "Consed: A grapical tool for sequence 
   finishing", Genome Research 8:195-202.