Download and Documentation for multiPopTagSelect Version 1.1
The MultiPop-TagSelect algorithm, as implemented in the program multiPopTagSelect.pl, attempts to select a near-minimal set of tagging single-nucleotide polymorphisms (tagSNPs) that account for all observed patterns of linkage disequilibrium (LD) in multiple populations. Specifically, it processes the output of tagSNP selection algorithms that designate bins of nearly equivalent SNPs, such that choosing (and typing) one SNP from each bin is sufficient to capture all associations observed in the data. Most of this documentation concerns one particular tagSNP selection method known as ldSelect, but multiPopTagSelect can also be used with other methods whose output has been converted into ldSelect format. This program can optimize SNP selection across any number of populations, and it can be applied to genomic regions ranging from small genes to entire chromosomes.
The program operates upon a different ldSelect output file for each population of interest. The optimization proceeds in two steps: First, all observed tagSNPs are assigned to mutually exclusive clusters and one "maximally informative" SNP (one that tags bins in the most populations) is chosen from each cluster. Second, the maximally informative tagSNPs are assembled into a list and removed one at a time. If a SNP can be removed from the list without causing any bins to lose representation, it is discarded; otherwise, it is returned to the list. The maximally informative SNPs that cannot be discarded through this process represent the final set of selected SNPs
Flags in square brackets are required; flags in diamond brackets are optional.
Specifies a file containing a list of ldSelect output files obtained from multiple populations in the same genomic region (details of ldSelect file format can be found here: ldSelect). Within this file, each filename should appear on a different line and be preceded by a file path if the ldSelect files do not reside in the directory from which the program will run. For example, suppose ldSelect had been used to identify tagSNPs in genotype data from the gene VCAM1 in three different populations, denoted pop1, pop2, and pop3, and that the output files had been stored in a subdirectory called MyData. In this case, the contents of the required input file might look like this:
MyData/vcam1.pop1.ldSelect.out MyData/vcam1.pop2.ldSelect.out MyData/vcam1.pop3.ldSelect.out
The names of ldSelect output files do not need to follow the convention used above.
Specifies a file that contains genomic and sequence context information for SNPs. The file follows a tab-delimited text format that contains the columns listed below:
SNP-identifier genomic-context sequence-context
644 5'-flanking unique 887 5'-UTR unique 3428 intron repeat 3596 synon unique 4125 intron unique 8561 nonsynon unique 8994 intron repeat 9512 intron repeat 11340 frame-shift unique 15005 3'-UTR unique 15885 3'-flanking repeat 16211 3'-flanking repeat
The SNP identifier must be a contiguous string of characters and, to prevent parsing errors in multiPopTagSelect, the third character in an identifier cannot be a hyphen. The program uses context information to preferentially select SNPs according to the following precedence hierarchy (in order of increasing precedence): 3'-flanking, intron, 5'-flanking, UTR (3' or 5'), synonymous coding, nonsynonymous coding, frame-shift. All sites in unique sequence are ranked ahead of those in repeat-containing sequence.
Context files of this sort can also be used to create ldSelect output files that contain context information in the form of a two-letter code preceding each SNP identifier (see Output. Either the file described above or a set of ldSelect files with context codes is sufficient for multiPopTagSelect. Note, however, that ldSelect files do not include 5' and 3' designations; the context file must be provided if this information is to be used. In the absence of a context file, SNPs in flanking regions and UTRs are treated as if they came from the 3' end of a gene.
Specifies a score (on any desirable scale) for each SNP. For example, this score might reflect the estimated probability of successfully typing the SNP in a certain assay. The program preferentially selects SNPs with higher scores, although rankings by genomic and sequence context take precedence over these scores (i.e., SNP design scores determine rankings within any existing context categories). Since the scores are used to assign relative SNP rankings, their absolute magnitudes are not important. The file follows a tab-delimited format that contains the columns listed below:
644 0.798 887 0.965 3428 0.241 3596 0.900 4125 0.650 8561 0.886 8994 0.613 9512 0.712 11340 0.000 15005 0.589 15875 0.544 16211 0.622
Specifies a file that contains a list of SNPs that are not to be selected. This function is provided because sometimes a SNP is difficult (or impossible) to genotype on a particular platform, so it is convenient to specify that it not be selected as a tagSNP. If alternative tagSNPs reside in the same bin as an excluded SNP, one of them will be chosen to represent the bin. SNPs that were excluded by ldSelect using the same flag do not need to be excluded again. WARNING: If all tagSNPs in a bin are excluded, that bin will not be represented in the SNP set selected by multiPopTagSelect. The file contains a column of all the SNP positions in text format as below:
644 887 3428 3596 4125
Specifies a file that contains a list of SNPs that must be selected. The program processes these SNPs like all the others, with the exception that required SNPs are not allowed to be discarded in the elimination steps. This is equivalent to minimizing the number of selected tagSNPs conditional upon the inclusion of the required SNPs. If the required SNPs are part of a minimal tagSNP set, the overall number of selected SNPs will not increase; if the required SNPs are not part of a minimal set, the selected SNP set will be slightly larger than the smallest set possible. It is valid to specify required SNPs that are not tagSNPs in any of the relevant ldSelect files. These SNPs will not affect the algorithm, but they will be included in the final SNP set (by convention, such SNPs will tag the "zero bin" in every population). The file contains a column of all the SNP positions in text format as below:
644 887 3428 3596 4125
Tells the program whether to seek a provably optimal solution (as opposed to a near-optimal solution) at the expense of additional computing time 'yes' (y) or 'no' (n). If the initial solution is found to be non-optimal, it will be replaced with an optimal solution and a message like the following will be shown on the screen:
Non-optimal solution found: set reduced from 132 SNPs to 130 SNPs.
Tells the program whether to display its progress on the screen 'yes' (y) or 'no' (n). The messages displayed during the course of a complete run are as follows:
1% of sites sorted. 2% of sites sorted. 99% of sites sorted. Sites sorted by bin content. First algorithm step 1% complete. First algorithm step 2% complete. ... First algorithm step 99% complete. Second algorithm step complete.
The first set of messages documents the process of sorting all observed tagSNPs by the number of populations in which they tag a bin (in decreasing order), a step that speeds up the algorithm. The next set of messages displays the percentage of tagSNPs that have passed through the first step of the algorithm. The second step of the algorithm is typically fast even for very large problems, so its progress is not shown in detail.
The output file contains two lines. The first reports the total number of SNPs selected, and the second lists the selected SNPs in alphabetical or numerical order (as appropriate). Each selected SNP also has a suffix showing which bins it tags in the ldSelect files used as input; these bin numbers are displayed in the same order as the ldSelect files and separated by forward slashes. For example, in the three-population example shown above, the output "3596-3/1/0" means that SNP 3596 tags Bin 3 in pop1, Bin 1 in pop2, and is not a tagSNP in pop3.
num_tagSNPs: 3 Selected_tagSNPs: 887-1/2/1 3596-3/1/0 15875-2/2/0
If the -context flag is used or the ldSelect output files were created using the -context flag in that program, all of the SNP identifiers are preceded by a two-letter code indicating the sequence context and the genomic context for the sites. The first letter indicates the sequence context: (U)nique sequence, (R)epeat-containing sequence, or (X) other. The second letter indicates the genomic context: (F)lanking region, 5' or 3' (U)TR, (I)ntron, (S)ynonymous cSNP, (N)onsynonymous cSNP, (T) frame shift, or (X) other.
num_tagSNPs: 3 Selected_tagSNPs: UU-887-1/2/1 US-3596-3/1/0 RF-15875-2/2/0
The program is written in Perl programming language. It can run on any operating system with Perl 5.0 or above installed. Click here to download multiPopTagSelect.pl
1. Howie BN, Carlson CS, Rieder MJ, Nickerson DA. Efficient selection of tagging single-nucleotide polymorphisms in multiple populations. Hum Genet. 2006 120(1):58-68.
2. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA. Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium. Am J Hum Genet. 2004 74(1):106-20.