Development of a small match pro

Development of a small match procedure

Initially John Griffiths assembled 200 DNA matches arising from a study group of 11 possible family members; John's last report #4 is here, and his observations on this are here. A later inspection by Julian Land of John's work - namely of the multiple segment matches arising from a subset of 7 of the relatives - is reported here with just 2 of 133 superficially intriguing multiple matches surviving scrutiny to ensure that these 2 were stronger than the types of multiples a random sample threw up. This is potentially a tough test in that there may be unknown genetic relationships within the random sample closer than the Baruch Lousada relationships we are seeking to confirm (see here), and indeed with members of the Lousada sample (see here). In any case the random sample, in this phase of the work, threw up surprising multiples (including both types of what we refer to as strong triples - the first with 3 people matching each other shows that triangulation with small 3cM segments does not prove family linkage, and the second with 1 person matching 3 others). The chromosome 2 and 8 matches were the most impressive, since their strong triples were augmented by linked matches unlike the strongest random multiples in this phase of the study (which were bare strong triples).

Through our work instructive methodological points emerged. In our 7-relative study we included a sample of 7 nominally unrelated people so that we could contrast cousin-cousin matches with unrelated-unrelated matches. For this work, with a 3cM threshold, GEDmatch advised that 'segment threshold size will be adjusted dynamically between 200 and 400 SNPs' whereas this changed in our later 2023 8 by 8 study to 'segment threshold size will be adjusted dynamically with an average of 200 SNPs. About 2/3 will occur between 185 and 214 SNPs'. Initially we noticed that the newer GEDmatch settings increased the number of matches between cousins as compared with matches between unrelated people, to the point where we got 5% more cousin-cousin single matches than unrelated-unrelated single matches (not fewer! - as with the prior GEDmatch setting). This 5% 'signal' is of course small amid the 'noise' of all those off-target matches and/or false positives. But we noticed that, with the newer GEDmatch setting, the number of cousin-cousin coincident multiple matches increased strongly - by twice the increase in unrelated-unrelated coincident multiple matches (2 times cf 1.4 times as reported in our 2023 8 by 8 study) from which we can see that John Griffiths' intuition is realised, at least to some extent. GEDmatch advised that its Qmatch technique should give further improvement when looking at 3cM segments as we must with our 11-generation separations; our early experience with Qmatch is discussed here. But samples of unrelated people will continue to generate coincident multiple matches, and accordingly we continue to need ways of distinguishing real from off-target matches and/or false positives. Our discovery of rare segment boundary coincidences (RSBCs) in which the same boundary SNP is shared by a pair of distinct overlapping pairs (that is, 4 separate individuals are represented at the SNP) - which has low odds - perhaps offered a way forward. We first saw a RSBC (at 52269392) in the Cr8 match during our 7-relative study, and reported it in 'the coffee-table book'. This original RSBC is somewhat obscured by Qmatch (3cM P=3) for the left boundary of 2 of the matches was adjusted by a small number of nucleotide positions as shown here. But through Qmatch, we discovered more RSBCs, and then - perhaps more importantly - ASBs and their significance.

After a lengthy exploration of both RSBCs and ASBs, and with further advice from GEDmatch, we were finally able to produce a conclusion from our small match work as shown here.