Duplicate search merge problem
Files included: (all zipped for faster
download)
1033812.ged Merge test 103.rmgc (duplicates color coded before
merge)
Merge test 104.rmgc (results of
duplicate search merge)
Written by Helen (Rootsmagic user)
I noticed in my database some individuals of the same name
and birthday but with different record numbers were not offered in a duplicate search merge.
This is especially a problem when you download your own or someone else’s tree
from Ancestry. Their process results in many duplicates in their
trees.
To check this I used a gedcom which I downloaded from
ancestry. I knew it was full of duplicates. The gedcom is “1033812.ged”
I put this gedcom into a RM4 database called “Merge test
101.rmgc” It had :-
1555 people
391 families
7318 citations.
I did a full automatic merge. That cleaned out a few
duplicates. Properties now had :-
1203 people
354 families
6328 citations.
I color coded the remaining duplicates. That file is called “Merge test 103.rmgc”
I did duplicate search merge, and merged a lot more of the
duplicates. Properties show :-
1168 people
347 families
5917 citations.
Called that database “Merge test
104.rmgc”
But I found a number of duplicates with the same birth
year still existing, about 140 of them. Those duplicates were never offered in the duplicates search merge. No
matter how often I repeated the search.
When I tested the results by putting gedcom “1033812.ged” into RM3, there were no duplicates left after all the merges were done. And
the properties were :-
984 people
286 families
854 citations.
It is possible to merge people manually one pair at a time
but it is very time consuming and I have lots of info on ancestry that would be nice to avoid typing in. But
I can’t use it if the merges don’t work well.

I ran a comparison test on Helens file just using all the RM automerge functions on the
three RM versions I currently have to hand so all should work exactly the same but doesn't. The main observations
from this are :-
1. RM4093 is not as effective as RM326 on Automerges
2. RM4096 is not as effective on Automerges as RM4093.
This leaves more hands on work for the user and therefore is not good, well it
certainly is not progress. In the table below the best scores are highlighted green and the worst highlighted red,
so as you can see the current version (4.0.9.6) is the least effective at automatically merging duplicate
individuals.
|
Origin file |
RM326 |
RM4093 |
RM4096 |
| People |
1555
|
1036
|
1111
|
1203
|
| Families |
391
|
302
|
325
|
354
|
| Events |
2784
|
1822
|
2013
|
2198
|
| Places |
647
|
647
|
647
|
647
|
| Sources |
1
|
1
|
1
|
1
|
| Citations |
1424
|
905
|
980
|
1072
|
| Repositories |
1
|
1
|
1
|
1
|
One idea has been put forward to allow some user input into what qualifies as a match allowing
the user to effectively set where the bar should be. Although very useful, this is also dangerous and I would
suggest that the user be forced to backup the file before any such routine was run and also encouraged through
a popup to examine there resulting file afterwards.
I know from experience when using Duplicate Search Merge that RM generated scores of 45
and above and invariably true matches. I would like to be able to scan down through the presented possible matches,
decide my safe point score, set this break point and run the merge again based on that user input
therefore removing a lot of manual user input.
It would seem from the results above that the speeding up of these operations which was
introduced in version 4.0.9.5 is more to do with some logic being removed rather than slicker programming.
This has resulted in an 8.3% increase in duplicates being left in 4.0.9.6 over 4.0.9.3 and 16% more
duplicates remaining unmerged when compared to RM3.
As files get larger programmed help in finding possible duplicates and merging them
needs to get better rather than being less effective and I do hope the developers can take note.
|