Documentation for ldSelect Version 1.0

Program: ldSelect
Version: 1.0
Copyright (C) 2004-2004
by Deborah A. Nickerson, Mark Rieder, Chris Carlson, Qian Yi
University of Washington

All rights reserved.

You have the permission to use and develop ldSelect.pl ("the software"), provided that the following conditions are met:
1. The software is not published, distributed, or otherwise transferred or made available to others.
2. If utilization of the software results in outcomes which will be published, please specify the version of the software you used and cite attributions noted above.
3. You acknowledge that University of Washington ("UW") and the UW developers may develop modifications to the software that may be substantially similar to your modifications of the software and that UW and UW developers shall not be constrained in any way by you in UW's and UW developers' use or management of such modifications. You acknowledge the right of the UW and UW developers to prepare and publish modifications to the software that may be substantially similar or functionally equivalent to your modifications and improvements, and if you obtain patent protection for any modification or improvement to the software, you agree not to allege or enjoin infringement of your patent by the UW and UW developers.

This software is provided ``AS IS'' and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.


Contents

  1. Description of Features
  2. Download and Installation
  3. More Information


Introduction

The ldSelect program analyzed the patterns of linkage disequilibrium (LD) between polymorphic sites in a locus, and bins the SNPs on the basis of a threshold level of LD as measured by r2.

At each round of selection, the binning algorithm identifies the single SNP which exceeds the threshold r2 with the maximum number of other SNPs, and sets this group of SNPs as a bin. Then each SNP within the bin is analyzed to determine whether it exceeds the threshold r2 with all other SNPs in the bin. All SNPs in a bin that meet this criterion are designated as TagSNPs. Only one TagSNP needs to be typed per bin.

The details of ldSelect algorithm can be referenced to the publication of Carlson et al., 2004.


Usage and Flags

The ldSelect is written in Perl programming language. It runs only via command-line input format.

ldSelect.pl
        [-pb]   prettybase file
        <-r2>   r-squared threshold, from 0.0 to 1.0, default: 0.64
        <-context>      a file containing genomic and sequence context for snps
        <-required>     a list of sites required to be tagSNPs
        <-verbose>      print command line args? (y or n; default in 'no')
        <-freq>         minor allele frequency threshold (>0.0 and <=0.5) 
        <-excluded>     a list of sites that can not be tagSNPs

Flags in square brackets are required; flags in diamond brackets are optional.

 

-pb [Required]
Specifies the prettybase file. The prettybase file is a tab-delimited text file that contains the SNP's genotype information formatted as below:

	Site   Sample   Allele1   Allele2

for example:

	000834   D001    G       G
	000834   D002    G       G
	000834   D003    G       G
	000834   D004    G       G
	000834   D005    N       N
	000834   E001    G       G
	000834   E002    G       G
	000834   E003    G       G
	000834   E004    G       G
	000834   E005    G       G
	000963   D001    T       T
	000963   D002    T       T
	000963   D003    T       T
	000963   D004    T       T
	000963   D005    N       N
	000963   E001    T       T
	000963   E002    N       N
	000963   E003    G       T
	000963   E004    G       G
	000963   E005    G       T

-r2 [Optional]
Specifies the r2 threshold of LD for binning SNPs. It can range from 0.0 to 1.0. The default value is set to 0.64.

-freq [Optional]
Specifies the minor allele frequency cutoff for snps to be clustered. It must be > 0.0 and <= 0.5.

-context [Optional]
Specfies the file that contains information of genomic and sequence context for SNPs. The file is in a tab-delimited text format that contains the columns listed below:

	SNP-coordinate   genomic-context   sequence-context

for example:

	644     5'-flanking     unique
	887     5'-flanking     unique
	644     5'-flanking     unique
	834     5'-flanking     unique
	3428    intron  repeat
	3524    intron  repeat
	3596    intron  unique
	4125    intron  unique
	40239   3'-flanking     repeat
	40265   3'-flanking     repeat

-sample [Optional]
Specfies the file that contains a list of sample names to be clustered. It is optional. The file contains a column of sample names in text format as below:


	D001
	D002
	D003
	D004

-required [Optional]
Specfies the file that contains a set of snps required to be tagSNPs. It is optional. The file contains a column of all the snp positions in text format as below:

	644
	834
	3524
	4125
	40239

-excluded [Optional]
Specifies a file that contains a set of snps that are excluded as tagSNPs. This function is provided because sometimes a SNP is difficult (or impossible) to genotype on a particular platform, so it is convenient to specify that it not be selected as a tagSNP. Excluded SNPs will not be used to anchor bins, and are shown in the output in parentheses (e.g. "(004645)"). If an excluded SNP falls into a singleton bin, this means that it is below r2 threshold with all non-excluded markers. The -excluded flag is optional. The file contains a column of all the snp positions in text format as below:

	644
	834
	3524
	4125
	40239


The Output Report

The output file is organized by bins that are sorted descendingly by the total number of sites in bins. The first line in each bin contains the total number of snp sites in the bin, and the average minor allele frequency for all the snps in the bin. The second line contains the tagSNPs in the bin, and the snps are listed by their positions. The third line contains the other snp sites that are not tagSNPs in the bin, and they are also listed by their positions.

for example:

Bin 1	total_sites: 8	average_minor_allele_frequency: 15%
Bin 1	TagSnps: 11710 
Bin 1	other_snps: 12954 13573 20207 26629 29517 33639 9367 

Bin 2	total_sites: 7	average_minor_allele_frequency: 44%
Bin 2	TagSnps: 38834 41744 
Bin 2	other_snps: 39727 40147 43118 43170 43928 

If the -context flag is used, all the snp sites are preceded with a two-letter code indicating the sequence context and the genomic context for the sites. The first letter indicates the sequence context: (U)nique sequence, (R)epeat containing sequence, (X) other. This information is useful because design genotyping assays is generally easier for unique sequences. The second letter indicates the genomic context: (F)lanking region, 5' or 3' (U)TR, (I)ntron, (S)ynonymous cSNP, (N)onsynonymous cSNP, (T) frame shift, (X) other.

for example:

Bin 1	total_sites: 8	average_minor_allele_frequency: 15%
Bin 1	TagSnps: UI-11710 
Bin 1	other_snps: US-12954 UI-13573 UI-20207 RI-26629 UI-29517 UI-33639 US-9367 

Bin 2	total_sites: 7	average_minor_allele_frequency: 44%
Bin 2	TagSnps: RI-38834 RI-41744 
Bin 2	other_snps: RI-39727 RI-40147 UI-43118 UI-43170 UI-43928 


Installation

The program is written in Perl programming language. It can run on any operating system with Perl 5.0 or above installed.


References

1. Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA. Selecting a maximally
   informative set of single-nucleotide polymorphisms for association analysis using linkage
   disequilibrium. Am J Hum Genet. 2004 Jan;74(1):106-20. Epub 2003 Dec 15.