shellfish: Parallel PCA and data processing for genome-wide SNP data
Table of Contents
Introduction
shellfish
carries out a variety of tasks related to principal
component analysis of genome-wide SNP data. Unlike other available
software, PCA computations can be carried out in parallel (both on a
computing cluster running the Sun Grid Engine, and also in the simple
case of a machine with multiple processors). In addition to the PCA
calculations, it automates the process of data subsetting and
allele-matching, using plink
and gtool
for file format
interconversion where necessary. The aim is that tasks that would
otherwise require a complex series of shell commands and/or work in
R
, can be carried out with a single, straightforward,
command.
Linear algebra computations make use of standard BLAS and LAPACK
libraries, and data manipulations are performed using standard shell
commands or one of several bundled command-line utilities written in
C. shellfish
is therefore very efficient and can be used on data
sets involving over 10,000 individuals typed at hundreds of thousands
of SNPs.
Please note that shellfish is beta software and is not currently under active development. While writing and maintaining software for the scientific community is important, it's not always clear that it's particularly beneficial to one's academic career.
Setting things up
-
shellfish
lives on linux/unix/OS X machines; not Windows. It is written inpython
, so the machine must havepython
installed. - Download and compile shellfish.
- If you're going to work with plink format files (.ped, .bed, etc), you probably want to have gtool and plink installed.
git clone git@github.com:dandavison/shellfish.git cd shellfish/src make all cd ..
Data formats
shellfish
can use the following input file formats
-
{.ped, .map}
plink
files -
{.bed, .bim, .fam} binary
plink
files -
{.gen, .sample} files uses by the
chiamo/impute/gtool/etc
suite of programs -
{.gen.gz, .sample} gzipped files uses by the
chiamo/impute/gtool/etc
suite of programs -
{.geno, .map} a simple, compact genotype format used by
shellfish
Example usage
Make shellfish-format data
Shellfish
automatically converts data to the necessary format,
but this example illustrates the basic arguments, and shows how to
form a dataset for analysis containing a subset of the SNPs in the
original data file:
./shellfish --make-geno --file bigdata --file2 smalldata --out smalldata_output
Note the following:
-
The requested action is supplied with the
--make-geno
option -
The main input is specified as
--file bigdata
. Whenever you supply an argument like--file basename
, that implies that, for example {basename.gen
andbasename.sample
} or {basename.geno
,basename.map
} exist. -
--file2 smalldata
is used to specify the subset of SNPs which are required in the output data file. This option implies that a map filesmalldata.map
exists, but not necessarily any associated genotype data (e.g.smalldata.ped
).
The corollary of this is that, when keeping lists of SNPs, keep
them in plink
.map
format. This might be a lists of SNPs that
are not in strong LD with each other, or a list of SNPs with PCA
loadings. Any extra information like PCA loadings and allele
frequencies can be stuck on as extra columns, after the mandatory
.map
format columns (chrom=1, rs=2, cM=3, bp=4, allele1=5,
allele2=6).
Parallel principal component analysis
Sun Grid Engine
The following command computes the first 10 principal components on a cluster running the Sun Grid Engine, using a maximum of 200 processes. Jobs are submitted at priority level 4.
./shellfish.py --pca --numpcs 10 --sge --sge-level 4 --maxprocs 200 --file basename --out outputname
Of course, you might want to add nohup
at the beginning, redirect
the output and background the process if it's going to be running
for a long time.
Normal multiprocessor machine
./shellfish.py --pca --numpcs 10 --maxprocs 6 --file basename --out outputname
Project individuals into pre-computed principal component space
Suppose you want to investigate the locations of a collection of genotyped individuals with respect to the 3 HapMap clusters. Rather than carrying out the principal component analysis from scratch, all that is needed are the "loadings" of your SNPs (or a large subset thereof) on the two principal components that describe structure in the HapMap. These loadings are available for download as follows:
After uncompressing (use gunzip) you'll see that those are
tab-separated files that follow the conventions of plink
.map
files. Here's the first two lines of one of them:
1 rs3094315 0 792429 A G 0.295238 -6.79218 1.628052 1 rs12562034 0 808311 A G 0.75 -4.819747 -3.31086
This is a general shellfish
feature: files containing
information about SNPs always follow the .map
format used by
plink, for the first 4 columns at least. Columns are
tab-separated, and there are at least 4 columns. The columns
contain:
- Chromosome number
- rs ID
- cM location (can be anything, but the column must be present)
- physical location
- allele 1
- allele 2
- allele frequency used when computing the PCA
- PC1 loading
- PC2 loading
- …
Briefly, suppose that
-
you have genotype data for Illumina SNPs in binary
plink
format, contained in files calledgenotypes.bed
,genotypes.bim
andgenotypes.fam
- you have downloaded the HapMap PCA SNP loadings file for Illumina provided above.
-
Your
genotypes.bim
file identifes SNPs by rs ID, in the same way as the snpload-ill.map file does.
The shellfish
command for projecting those individuals into the
HapMap principal component space is:
./shellfish.py --project --numpcs 2 --file genotypes --file2 snpload-ill
Note that shellfish
commands generally follow plink
conventions:
-
In particular, you supply the file names
genotypes
andsnpload-ill
without the extensions (.bed,.map,.bim
, etc).
If you know how to put executables into a directory listed in your
$PATH variable then fine; but if you are unsure what that means,
then for now you will be best off executing all commands from the
shellfish
directory that you downloaded and unzipped. Also, you
should put plink
and gtool
in there, or else be in a position
where you can just type plink
and gtool
and the system will
know what you mean.