Deriving kinship coefficients from samples and genotypes in a VCF file.

I am currently investigating a rare phenotype among a cohort of ostensibly unrelated individuals.  Together with a colleague here at the University of Virginia, I have performed deep exome sequencing of each individual in our cohort in an effort to identify new genetic variation that accounts for the phenotype.  In so doing, we noticed (rather by chance) that two of our samples had strikingly similar genotypes at several loci that we were interested in. This led quickly to the suspicion of two possible scenarios. First, it might be that the two samples were closely related. Alternatively, it could be that the two sample DNAs were from the sample individual; in other words, there was a laboratory mix up that led to the same individual being sequenced twice.

Now, were the samples to have come from the same individual (or monozygotic twins), one would expect that well over 95% of the genotypes for the two samples would be identical (the remaining 5% being discordant owing to fluctuations in sequence coverage that would lead to missed heterozygote calls in one sample or the other). Yet if the samples are related, but there relatedness is more cryptic, we need a more powerful statistic to describe the level of relatedness between our samples. The kinship coefficient (f) is a measure of relatedness that represents the probability that two alleles, one sampled at random from each sample/individual, are identical by descent.  For example, take a mother and child. If you sample an allele at given site from the child, there is a 0.50 probability that that allele came from the mother. Given that the allele came from the mother, there is 0.50 probability that the allele chosen at random in the mother is the same as the allele already chosen in the child.  These are independent probabilities, so the kinship coefficient for a parent (P) and child (C) is:

fPC = 0.50 * 0.50 = 0.25.

People are often confused by the fact that the kinship coefficient for a self-self comparison or a monozygotic twin comparison is 0.50.  Yet recall that we are sampling alleles at random (with replacement) and testing for identity by descent.  Thus, even for the same individual, we have a 0.50 probability of choosing the same parental allele twice.

The following table lists the kinship coefficient (and the degree of relatedness, r = 2*f) for several common cases.

Relationship        Kinship coefficient   Coeffcient of relatedness
Self                0.5000                1.000
Monozygotic twins   0.5000                1.000
Parent-child        0.2500                0.500
Full siblings       0.2500                0.500
Half siblings       0.1250                0.250
First cousins       0.0625                0.1250
Unrelated           0.0000                0.0000

The kinship coefficient is extremely useful for identifying cryptic relationships among  samples (e.g. population stratification, inbreeding, etc.) and as a potent means of quality control.  For example, as I describe here, it can be used to identify identical samples, as well as unexpected relationship among ostensibly unrelated samples.

It turns out that my colleagues Ani Manichaikul and Wei-min Chen, both Assistant Professors in the Center for Public Health Genomics at The University of Virginia, recently published a new software package (KING) for rapidly computing the kinship coefficient (among many other useful statistics). However, KING does not directly support the VCF format as input. It turns out that one can get around this problem by using vcftools, PLINK, and then KING. Below are the necessary steps:

First use vcftools to convert the VCF file (call it example.vcf) to PLINK .PED and .MAP formats:

$ vcftools --vcf example.vcf --plink

By default, this will create two new files, and out.ped. As KING will accept PLINK’s binary PED or BED (I know, it’s confusing given the UCSC BED format for defining intervals) format, we next use PLINK to convert the PED and MAP files to a single BED file:

$ plink --file out --make-bed

Use the genotypes present in the VCF (now binary PED or BED) file, we are now ready to use KING to estimate the kinship coefficient for our samples:

$ king -b plink.bed --kinship

By screening the output (I’ve truncated it for clarity and simplicity), we can see that KING reports, as the 8th column, the kinship coefficient between the samples in the VCF file.  For example, whereas sample 7 (s7) has very little kinship with samples 11 and 13 (0.0013 and 0.0052, respectively), the kinship coefficient between samples 2 and 14 (0.4970) suggests that the two samples are from the same individual, or are monozygotic twins.  In this case, the same DNA sample was mistakenly sequenced twice.

$ cat king.kin0
FID1   ID1    FID2     ID2     N_SNP      HetHet    IBS0     Kinship
s7     s7     s11      s11     245941     0.090     0.0694   0.0013
s7     s7     s13      s13     247179     0.088     0.0713   0.0052
s2     s2     s14      s14     251994     0.234     0.0071   0.4970
This entry was posted in genomics, KING, plink, population genetics, vcftools. Bookmark the permalink.