All rights reserved.
You have the permission to use and develop ldSelect.pl ("the
software"), provided that the following conditions are met:
This software is provided ``AS IS'' and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
The ldSelect program analyzed the patterns of linkage disequilibrium (LD) between polymorphic sites in a locus, and bins the SNPs on the basis of a threshold level of LD as measured by r2.
At each round of selection, the binning algorithm identifies the single SNP which exceeds the threshold r2 with the maximum number of other SNPs, and sets this group of SNPs as a bin. Then each SNP within the bin is analyzed to determine whether it exceeds the threshold r2 with all other SNPs in the bin. All SNPs in a bin that meet this criterion are designated as TagSNPs. Only one TagSNP needs to be typed per bin.
The details of ldSelect algorithm can be referenced to the publication of Carlson et al., 2004.
The ldSelect is written in Perl programming language. It runs only via command-line input format.
ldSelect.pl [-pb] prettybase file <-r2> r-squared threshold, from 0.0 to 1.0, default: 0.64 <-context> a file containing genomic and sequence context for snps <-required> a list of sites required to be tagSNPs <-verbose> print command line args? (y or n; default in 'no') <-freq> minor allele frequency threshold (>0.0 and <=0.5) <-excluded> a list of sites that can not be tagSNPs
Flags in square brackets are required; flags in diamond brackets are optional.
Specifies the prettybase file. The prettybase file is a tab-delimited text file that contains the SNP's genotype information formatted as below:
Site Sample Allele1 Allele2
000834 D001 G G 000834 D002 G G 000834 D003 G G 000834 D004 G G 000834 D005 N N 000834 E001 G G 000834 E002 G G 000834 E003 G G 000834 E004 G G 000834 E005 G G 000963 D001 T T 000963 D002 T T 000963 D003 T T 000963 D004 T T 000963 D005 N N 000963 E001 T T 000963 E002 N N 000963 E003 G T 000963 E004 G G 000963 E005 G T
Specifies the r2 threshold of LD for binning SNPs. It can range from 0.0 to 1.0. The default value is set to 0.64.
Specifies the minor allele frequency cutoff for snps to be clustered. It must be > 0.0 and <= 0.5.
Specfies the file that contains information of genomic and sequence context for SNPs. The file is in a tab-delimited text format that contains the columns listed below:
SNP-coordinate genomic-context sequence-context
644 5'-flanking unique 887 5'-flanking unique 644 5'-flanking unique 834 5'-flanking unique 3428 intron repeat 3524 intron repeat 3596 intron unique 4125 intron unique 40239 3'-flanking repeat 40265 3'-flanking repeat
Specfies the file that contains a list of sample names to be clustered. It is optional. The file contains a column of sample names in text format as below:
D001 D002 D003 D004
Specfies the file that contains a set of snps required to be tagSNPs. It is optional. The file contains a column of all the snp positions in text format as below:
644 834 3524 4125 40239
Specifies a file that contains a set of snps that are excluded as tagSNPs. This function is provided because sometimes a SNP is difficult (or impossible) to genotype on a particular platform, so it is convenient to specify that it not be selected as a tagSNP. Excluded SNPs will not be used to anchor bins, and are shown in the output in parentheses (e.g. "(004645)"). If an excluded SNP falls into a singleton bin, this means that it is below r2 threshold with all non-excluded markers. The -excluded flag is optional. The file contains a column of all the snp positions in text format as below:
644 834 3524 4125 40239
The output file is organized by bins that are sorted descendingly by the total number of sites in bins. The first line in each bin contains the total number of snp sites in the bin, and the average minor allele frequency for all the snps in the bin. The second line contains the tagSNPs in the bin, and the snps are listed by their positions. The third line contains the other snp sites that are not tagSNPs in the bin, and they are also listed by their positions.
Bin 1 total_sites: 8 average_minor_allele_frequency: 15% Bin 1 TagSnps: 11710 Bin 1 other_snps: 12954 13573 20207 26629 29517 33639 9367 Bin 2 total_sites: 7 average_minor_allele_frequency: 44% Bin 2 TagSnps: 38834 41744 Bin 2 other_snps: 39727 40147 43118 43170 43928
If the -context flag is used, all the snp sites are preceded with a two-letter code indicating the sequence context and the genomic context for the sites. The first letter indicates the sequence context: (U)nique sequence, (R)epeat containing sequence, (X) other. This information is useful because design genotyping assays is generally easier for unique sequences. The second letter indicates the genomic context: (F)lanking region, 5' or 3' (U)TR, (I)ntron, (S)ynonymous cSNP, (N)onsynonymous cSNP, (T) frame shift, (X) other.
Bin 1 total_sites: 8 average_minor_allele_frequency: 15% Bin 1 TagSnps: UI-11710 Bin 1 other_snps: US-12954 UI-13573 UI-20207 RI-26629 UI-29517 UI-33639 US-9367 Bin 2 total_sites: 7 average_minor_allele_frequency: 44% Bin 2 TagSnps: RI-38834 RI-41744 Bin 2 other_snps: RI-39727 RI-40147 UI-43118 UI-43170 UI-43928
The program is written in Perl programming language. It can run on any operating system with Perl 5.0 or above installed.
1. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA. Selecting a maximally informative set of single-nucleotide polymorphisms for association analysis using linkage disequilibrium. Am J Hum Genet. 2004 Jan;74(1):106-20. Epub 2003 Dec 15.