Use R to create a PED/MAP dataset for \(N = 500\) cases + \(N = 500\) controls and \(n = 2\) SNPs so that:
- sex is chosen randomly
- each individual is randomly assigned to one of two separate subpopulations \(A\) or \(B\) (aka stratification)
- the first SNP* conforms to the Hardy-Weinberg principle in the whole population \(A \cup B\) with a unique allele frequency \(f = 0.3\)
- the second SNP* conforms does conform to the HW principle in each separate subpopulation \(f_A = 0.3\) and \(f_B = 0.6\)
- the third locus* is not polymorphic, neither in population A nor in population B, so all loci will be the same allele.
Create a phenotype file with the population encoded as 1 (“A”) or 2 (“B”). Don’t forget the header file. This will be useful later on to calculate \(F_{st}\).
Using plink, convert the PED/MAP files to the following formats:
-
plink --file sample --make-bed --out sample
Remember that
--file sample
refers to the two files sample.ped, sample.map. Now that we have generated the BED/BIM/FAM files, we’ll hereafter use--bfile sample
. The option--out
allows the user to choose the prefix used for all output files (“plink” would be used otherwise). -
plink --file bsample --recode transpose --out sample
The
--recode
option is the general option allowing conversion from one file format to the other. Additive RAW using
plink --bfile sample --recode A --out sample
-
Describe each file and each format** and think of all the advantages and disadvantages each format has over the others.
What’s the consequence of the absence of polymorphism for the third locus as far as the formats are concerned?
Calculate the \(F_{st}\) index using the fst option
plink --bfile sample --fst case-control \ --pheno sample.pop --pheno-name Subpopulation \ --out sample
(The backaslash signs allows us to write a long command line over several lines.)
What’s the consequence of the absence of polymorphism for the third locus as far as the \(F_{st}\) is concerned?
Import the (additive) RAW file in R and the population/phenotype file and calculate the \(F_{st}\) values using R based on the formula \[ F_{st} = \frac{ \bar{p} (1 - \bar{p}) - \overline{p (1 - p)} }{\bar{p} (1 - \bar{p})} \]
\(\bar{p} (1 - \bar{p})\) is calculed using the whole population’s estimated allele frequency \(\bar{p}\);
the \(\overline{p (1 - p)}\) is the average of the \(p_i (1 - p_i)\) calculated for each population \(i\) (\(p_i\) being the allele frequencing within that population).
* Note that the alleles as well as the chromosome location are irrelevant for this application so make arbitrary choices.
** For your convenience, each format mentioned in this exercise comes with a link to the relevant part of plink’s documentation.
Addendum
See previous exercise for installation instructions.
\(F_{st}\) (aka fixation index) is one of the commonest statistics used to measure the divergence between populations.
plink uses a more complex formula to calculate the \(F_{st}\) values (Weir and Cockerham, Evolution 1984, 38(6):1358-1370).