Download and Documentation for multiPopTagSelect Version 1.1

Program: multiPopTagSelect Version: 1.1
Copyright 2005-2006
by Bryan N. Howie, Christopher S. Carlson, Mark J. Rieder, and Deborah A. Nickerson
University of Washington

All rights reserved.

You have the permission to use and develop ("the software"), provided that the following conditions are met:

1. The software is not published, distributed, or otherwise transferred or made available to others.

2. If utilization of the software results in outcomes which will be published, please specify the version of the software you used and cite attributions noted above.

3. You acknowledge that University of Washington ("UW") and the UW developers may develop modifications to the software that may be substantially similar to your modifications of the software and that UW and UW developers shall not be constrained in any way by you in UW's and UW developers' use or management of such modifications. You acknowledge the right of the UW and UW developers to prepare and publish modifications to the software that may be substantially similar or functionally equivalent to your modifications and improvements, and if you obtain patent protection for any modification or improvement to the software, you agree not to allege or enjoin infringement of your patent by the UW and UW developers.

This software is provided AS IS and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.


  1. Introduction
  2. Usage and Flags
  3. Output
  4. Download and Installation
  5. Reference


The MultiPop-TagSelect algorithm, as implemented in the program, attempts to select a near-minimal set of tagging single-nucleotide polymorphisms (tagSNPs) that account for all observed patterns of linkage disequilibrium (LD) in multiple populations. Specifically, it processes the output of tagSNP selection algorithms that designate bins of nearly equivalent SNPs, such that choosing (and typing) one SNP from each bin is sufficient to capture all associations observed in the data. Most of this documentation concerns one particular tagSNP selection method known as ldSelect, but multiPopTagSelect can also be used with other methods whose output has been converted into ldSelect format. This program can optimize SNP selection across any number of populations, and it can be applied to genomic regions ranging from small genes to entire chromosomes.

The program operates upon a different ldSelect output file for each population of interest. The optimization proceeds in two steps: First, all observed tagSNPs are assigned to mutually exclusive clusters and one "maximally informative" SNP (one that tags bins in the most populations) is chosen from each cluster. Second, the maximally informative tagSNPs are assembled into a list and removed one at a time. If a SNP can be removed from the list without causing any bins to lose representation, it is discarded; otherwise, it is returned to the list. The maximally informative SNPs that cannot be discarded through this process represent the final set of selected SNPs

Usage and Flags

The multiPopTagSelect program is written in Perl programming language. It runs only via command-line input.
[-ld_select]	file containing a list of ldSelect output file names
<-context>	file containing genomic and sequence context info
<-scores>	file containing SNP design scores
<-excluded>	file containing a list of SNPs that must not be selected
<-required>	file containing a list of SNPs that must be selected
<-optimal>	seek a provably optimal solution? (y or n; default = n)
<-verbose>	display program progress? (y or n; default = n)

Flags in square brackets are required; flags in diamond brackets are optional.

-ld_select (required)

Specifies a file containing a list of ldSelect output files obtained from multiple populations in the same genomic region (details of ldSelect file format can be found here: ldSelect). Within this file, each filename should appear on a different line and be preceded by a file path if the ldSelect files do not reside in the directory from which the program will run. For example, suppose ldSelect had been used to identify tagSNPs in genotype data from the gene VCAM1 in three different populations, denoted pop1, pop2, and pop3, and that the output files had been stored in a subdirectory called MyData. In this case, the contents of the required input file might look like this:


The names of ldSelect output files do not need to follow the convention used above.

-context (optional)

Specifies a file that contains genomic and sequence context information for SNPs. The file follows a tab-delimited text format that contains the columns listed below:

SNP-identifier   genomic-context   sequence-context

For example:

	644	5'-flanking	unique
	887	5'-UTR		unique
	3428	intron		repeat
	3596	synon		unique
	4125	intron		unique
	8561	nonsynon	unique
	8994	intron		repeat
	9512	intron		repeat
	11340	frame-shift	unique
	15005	3'-UTR		unique
	15885	3'-flanking	repeat
	16211	3'-flanking	repeat

The SNP identifier must be a contiguous string of characters and, to prevent parsing errors in multiPopTagSelect, the third character in an identifier cannot be a hyphen. The program uses context information to preferentially select SNPs according to the following precedence hierarchy (in order of increasing precedence): 3'-flanking, intron, 5'-flanking, UTR (3' or 5'), synonymous coding, nonsynonymous coding, frame-shift. All sites in unique sequence are ranked ahead of those in repeat-containing sequence.

Context files of this sort can also be used to create ldSelect output files that contain context information in the form of a two-letter code preceding each SNP identifier (see Output. Either the file described above or a set of ldSelect files with context codes is sufficient for multiPopTagSelect. Note, however, that ldSelect files do not include 5' and 3' designations; the context file must be provided if this information is to be used. In the absence of a context file, SNPs in flanking regions and UTRs are treated as if they came from the 3' end of a gene.

-scores (optional)

Specifies a score (on any desirable scale) for each SNP. For example, this score might reflect the estimated probability of successfully typing the SNP in a certain assay. The program preferentially selects SNPs with higher scores, although rankings by genomic and sequence context take precedence over these scores (i.e., SNP design scores determine rankings within any existing context categories). Since the scores are used to assign relative SNP rankings, their absolute magnitudes are not important. The file follows a tab-delimited format that contains the columns listed below:

SNP-identifier     design-score

For example:

	644	0.798
	887	0.965
	3428	0.241
	3596	0.900
	4125	0.650
	8561	0.886
	8994	0.613
	9512	0.712
	11340	0.000
	15005	0.589
	15875	0.544
	16211	0.622

-excluded (optional)

Specifies a file that contains a list of SNPs that are not to be selected. This function is provided because sometimes a SNP is difficult (or impossible) to genotype on a particular platform, so it is convenient to specify that it not be selected as a tagSNP. If alternative tagSNPs reside in the same bin as an excluded SNP, one of them will be chosen to represent the bin. SNPs that were excluded by ldSelect using the same flag do not need to be excluded again. WARNING: If all tagSNPs in a bin are excluded, that bin will not be represented in the SNP set selected by multiPopTagSelect. The file contains a column of all the SNP positions in text format as below:


-required (optional)

Specifies a file that contains a list of SNPs that must be selected. The program processes these SNPs like all the others, with the exception that required SNPs are not allowed to be discarded in the elimination steps. This is equivalent to minimizing the number of selected tagSNPs conditional upon the inclusion of the required SNPs. If the required SNPs are part of a minimal tagSNP set, the overall number of selected SNPs will not increase; if the required SNPs are not part of a minimal set, the selected SNP set will be slightly larger than the smallest set possible. It is valid to specify required SNPs that are not tagSNPs in any of the relevant ldSelect files. These SNPs will not affect the algorithm, but they will be included in the final SNP set (by convention, such SNPs will tag the "zero bin" in every population). The file contains a column of all the SNP positions in text format as below:


-optimal (optional)

Tells the program whether to seek a provably optimal solution (as opposed to a near-optimal solution) at the expense of additional computing time 'yes' (y) or 'no' (n). If the initial solution is found to be non-optimal, it will be replaced with an optimal solution and a message like the following will be shown on the screen:

	Non-optimal solution found: set reduced from 132 SNPs to 130 SNPs.

-verbose (optional)

Tells the program whether to display its progress on the screen 'yes' (y) or 'no' (n). The messages displayed during the course of a complete run are as follows:

	1% of sites sorted.
	2% of sites sorted.
	99% of sites sorted.
	Sites sorted by bin content.
	First algorithm step 1% complete.
	First algorithm step 2% complete.
	First algorithm step 99% complete.
	Second algorithm step complete.

The first set of messages documents the process of sorting all observed tagSNPs by the number of populations in which they tag a bin (in decreasing order), a step that speeds up the algorithm. The next set of messages displays the percentage of tagSNPs that have passed through the first step of the algorithm. The second step of the algorithm is typically fast even for very large problems, so its progress is not shown in detail.


3. Output

The output file contains two lines. The first reports the total number of SNPs selected, and the second lists the selected SNPs in alphabetical or numerical order (as appropriate). Each selected SNP also has a suffix showing which bins it tags in the ldSelect files used as input; these bin numbers are displayed in the same order as the ldSelect files and separated by forward slashes. For example, in the three-population example shown above, the output "3596-3/1/0" means that SNP 3596 tags Bin 3 in pop1, Bin 1 in pop2, and is not a tagSNP in pop3.

For example:

	num_tagSNPs: 3
	Selected_tagSNPs: 887-1/2/1 3596-3/1/0 15875-2/2/0

If the -context flag is used or the ldSelect output files were created using the -context flag in that program, all of the SNP identifiers are preceded by a two-letter code indicating the sequence context and the genomic context for the sites. The first letter indicates the sequence context: (U)nique sequence, (R)epeat-containing sequence, or (X) other. The second letter indicates the genomic context: (F)lanking region, 5' or 3' (U)TR, (I)ntron, (S)ynonymous cSNP, (N)onsynonymous cSNP, (T) frame shift, or (X) other.

For example:

	num_tagSNPs: 3
	Selected_tagSNPs: UU-887-1/2/1 US-3596-3/1/0 RF-15875-2/2/0

4. Download and Installation

The program is written in Perl programming language. It can run on any operating system with Perl 5.0 or above installed. Click here to download

5. References

1. Howie BN, Carlson CS, Rieder MJ, Nickerson DA. Efficient selection of tagging single-nucleotide polymorphisms in multiple populations. Hum Genet. 2006 120(1):58-68.

2. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA. Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium. Am J Hum Genet. 2004 74(1):106-20.