Thursday, August 10, 2017

Imprecise Science Part 1: AncestryDNA


Are all your DNA matches really a match?

Debbie Kennett recently published her methodology and results when comparing her own AncestryDNA match results to her parents’ matches in order to identify false matches (either a false positive at child level, so identical by chance, or a false negative at parent level, or some horrible algorithmic glitch). Debbie’s post built on analysis done by Blaine Bettinger and other DNA genealogists and she has links to their respective results in her post.

 As I have tested myself and both my parents I thought I’d do the same. There's some extra commentary at the end regarding Wing one-place study implications, just to keep this on-topic for this blog!

Summary

Of my 14,522 matches:

  • 4,470, or 31%, are not shared with either parent
  • 4,528, or 31%, are shared with my dad (which represents 44% of his matches)
  • 5,547, or 38%, are shared with my mum (which represents 34% of her matches)

15 cM looks like a good cutoff point above which any match is almost certainly going to be legitimate. Sadly only 3.75% of my matches are above that level! 

Conversely, any matches below 7cM are more likely to be identical by chance rather than identical by descent.

This is pretty similar to the kind of picture Debbie saw with her AncestryDNA results.


Data

Here are my personal matches, broken down by the cM total length:



Those 3 matches of 50cM or more are my parents plus my grandmother’s brother.

Of those 21 matches of 25cM or more, only 1 has a tree of more than a couple of generations. I’ll save that rant for another time…

Here are my matches broken down into slightly more cM bins, along with what does and doesn’t match at least one of my parents:


See that wee outlier? I have a 25 cM match that on the face of it doesn’t look like a real match. I’ve done some digging into this one – her kit is administered by her daughter, and the daughter IS a match to both me and my father. The daughter and my Dad have a 17cM match, yet I match to the daughter at 18cM (it's over 2 segments, not 1 as my Dad's match is) and the mum at 25cM! I’m still trying to get my head around that.


What Now?

I knew there were some false matches knocking around, but I’m a bit disappointed it’s as high as it is to be honest. It does mean that if I’m doing any further analysis of my matches, eg following Twigs of Yore’s visualisation exercises using NodeXL, I should definitely start with a match list that’s had those false positives removed if possible rather than using a raw match list. Not only would the list be 30% shorter and more manageable to handle, it would be more accurate.

Our DNA raw data has also been uploaded to FamilyTreeDNA, MyHeritage and GedMatch and I’m planning to run comparisons on my match results at each of those places (update: follow the links to see the comparisons). At MyHeritage, I have an alleged 94cM total match that is not a match to either of my parents! I suspect that in reality I do not have a gorgeous blonde Norwegian cousin (with whom I apparently also somehow share 0% ethnicity), I think something must be awry in the algorithms.

Wing

I currently have 8 matches at AncestryDNA who have a family tree that has someone born in Wing Buckinghamshire in it. That doesn’t mean we’re related through Wing (or even at all), although you can bet I’m keen to find out one way or the other. In one case it looks like I can tell – one of them (who even bears one of my Wing surnames!) isn’t a match to either of my parents so most likely he is a false positive match and we are not related. Is that a sad trombone I hear?



5 comments:

Debbie Kennett said...

Thank you Alex for doing this interesting analysis. I was intrigued by your breakdown of your large mismatch. This does appear to be a case of a false negative in the parent but I'm just as confused as you are by what's going on here. I still don't have any sense of the ratio of false positives versus false negatives in our data. This exercise has, however, demonstrated the uncertainty of the matches on smaller segments which we will need to factor in if we are using these segments to make inferences about genealogical relationships.

I'd be interested to know which version of the AncestryDNA chips you and your parents were tested on. I was tested on the v1 chip and my parents were on the v2 chip so I don't know what impact, if any, this had on my analysis.

Jim Bartlett said...

Alex,
We need to be careful about calling all of these Matches false (the ones that don't match either parent). Some of them are truly false and are not a segment from an Ancestor; but the rest of them are false negatives. That means they are really a true Match to you, and the "falseness" is just that the algorithm doesn't match them to a parent. There have been many reports that the AncestryDNA TIMBER algorithm sometimes doesn't report true matches up to 40cM or so. A false negative is really a true Match, although at the lower cM levels (say under 10-15cM, the Common Ancestor may be beyond our reach. Bottom line: our "false" Matches at AncestryDNA are not as high as they appear. It will be interesting to see what you determine through GEDmatch.

Alex Coles said...

Debbie, all three of the tests were on v1 chip so in my case there wouldn't have been any impact from chip variation.

Jim, thanks for your comments. I won't discard all those potentially-dodgy matches just yet! I'm interested to see what turns up in GEDmatch (or at FamilyTreeDNA for that matter) where other AncestryDNA testers have uploaded their kits and I can identify them both on Ancestry and GEDmatch - I'll definitely report back on whether the matching situation changes or not.

Debbie Kennett said...

Alex, It's interesting that all your family were tested on the v1 chip. I wonder if the move to the v2 chip might produce fewer non-matches between parents and children.

It will be certainly interesting to do some comparisons at GEDmatch and FTDNA. As you've tested both your parents you'll be able to use phased kits at GEDmatch. This is an exercise I still need to do.

As Jim says, a lot of the non-matches might well be false positives but at the moment we have no way of distinguishing between false positives and false negatives.

Jim, Ancestry's Timber algorithm is downweighting high-frequency segments. While these segments might be real segments in the sense that they have survived phasing, it doesn't mean that they are genealogically relevant. The frequency of a segment is just as important as its size when trying to detect recent IBD. Segments shared by large numbers of people are highly unlikely to be recent.

Jason Lee said...

It would be interesting to see how this breaks down with single segment matches.

 
template by suckmylolly.com : background by Tayler : dingbat font TackODing