I am currently investigating a rare phenotype among a cohort of ostensibly unrelated individuals. Together with a colleague here at the University of Virginia, I have performed deep exome sequencing of each individual in our cohort in an effort to identify new genetic variation that accounts for the phenotype. In so doing, we noticed (rather by chance) that two of our samples had strikingly similar genotypes at several loci that we were interested in. This led quickly to the suspicion of two possible scenarios. First, it might be that the two samples were closely related. Alternatively, it could be that the two sample DNAs were from the sample individual; in other words, there was a laboratory mix up that led to the same individual being sequenced twice.
Now, were the samples to have come from the same individual (or monozygotic twins), one would expect that well over 95% of the genotypes for the two samples would be identical (the remaining 5% being discordant owing to fluctuations in sequence coverage that would lead to missed heterozygote calls in one sample or the other). Yet if the samples are related, but there relatedness is more cryptic, we need a more powerful statistic to describe the level of relatedness between our samples. The kinship coefficient (f) is a measure of relatedness that represents the probability that two alleles, one sampled at random from each sample/individual, are identical by descent. For example, take a mother and child. If you sample an allele at given site from the child, there is a 0.50 probability that that allele came from the mother. Given that the allele came from the mother, there is 0.50 probability that the allele chosen at random in the mother is the same as the allele already chosen in the child. These are independent probabilities, so the kinship coefficient for a parent (P) and child (C) is:
fPC = 0.50 * 0.50 = 0.25.
People are often confused by the fact that the kinship coefficient for a self-self comparison or a monozygotic twin comparison is 0.50. Yet recall that we are sampling alleles at random (with replacement) and testing for identity by descent. Thus, even for the same individual, we have a 0.50 probability of choosing the same parental allele twice.
The following table lists the kinship coefficient (and the degree of relatedness, r = 2*f) for several common cases.
Relationship Kinship coefficient Coeffcient of relatedness Self 0.5000 1.000 Monozygotic twins 0.5000 1.000 Parent-child 0.2500 0.500 Full siblings 0.2500 0.500 Half siblings 0.1250 0.250 First cousins 0.0625 0.1250 Unrelated 0.0000 0.0000
The kinship coefficient is extremely useful for identifying cryptic relationships among samples (e.g. population stratification, inbreeding, etc.) and as a potent means of quality control. For example, as I describe here, it can be used to identify identical samples, as well as unexpected relationship among ostensibly unrelated samples.
It turns out that my colleagues Ani Manichaikul and Wei-min Chen, both Assistant Professors in the Center for Public Health Genomics at The University of Virginia, recently published a new software package (KING) for rapidly computing the kinship coefficient (among many other useful statistics). However, KING does not directly support the VCF format as input. It turns out that one can get around this problem by using vcftools, PLINK, and then KING. Below are the necessary steps:
$ vcftools --vcf example.vcf --plink
By default, this will create two new files, out.map and out.ped. As KING will accept PLINK’s binary PED or BED (I know, it’s confusing given the UCSC BED format for defining intervals) format, we next use PLINK to convert the PED and MAP files to a single BED file:
$ plink --file out --make-bed
Use the genotypes present in the VCF (now binary PED or BED) file, we are now ready to use KING to estimate the kinship coefficient for our samples:
$ king -b plink.bed --kinship
By screening the output (I’ve truncated it for clarity and simplicity), we can see that KING reports, as the 8th column, the kinship coefficient between the samples in the VCF file. For example, whereas sample 7 (s7) has very little kinship with samples 11 and 13 (0.0013 and 0.0052, respectively), the kinship coefficient between samples 2 and 14 (0.4970) suggests that the two samples are from the same individual, or are monozygotic twins. In this case, the same DNA sample was mistakenly sequenced twice.
$ cat king.kin0 FID1 ID1 FID2 ID2 N_SNP HetHet IBS0 Kinship s7 s7 s11 s11 245941 0.090 0.0694 0.0013 s7 s7 s13 s13 247179 0.088 0.0713 0.0052 ... s2 s2 s14 s14 251994 0.234 0.0071 0.4970