I have an Sql database with several tables. It has about 75,000 records in it but also has a lot of dupes. Can anyone help me with a script to isolate the most possible dupes? One of the most common things I'm seeing is mispelled names.
Thanks.Check out this post:
http://www.dbforums.com/t1001522.html|||A good starting point would be to quickly link an Access front-end to your DB and then run the Access Find Duplicates query wizard to get an idea of how it does it. Unfortunately there is absolutely no way you are ever going to remove all duplicates from a database. I have had quite a lot of experience with this - NEVER promise anyone that you can do it!
One of the problems with removing cutomer/supplier dups is that if the duplicate customer/supplier also has records in other related tables you will then want to link that data to the duplicate you are keeping if u know what I mean...
Cheers|||I've tried access and the number I am getting seems unbelieveable. Do you know of any tools that might work?|||Hi,
There is software in the marketplace that will do this sort of thing but it all depends on how conisitently your data has been entered in the first place e.g. some users might enter 'Mr Matt McDonald' into a name field whilst others might enter 'Matt McDonald' - a standard database de-duping routine obviously wouldn't pick this up.
Whilst working for a mailing house and de-duping customer data I used software developed by QAS (www.qas.com) but it is very expensive and only works well if you have consistent customer address fields as it looks for postcodes etc. - as far as I'm aware most of the de-duping software works like this.
You say that Access returns more records that you think it should - maybe you need to change the criteria to specify what really are duplicates. Have you tried double-checking some of the results to see if they really are duplicates?? Other than that the only way round this is to write your own custom procedure using a combination of code (maybe VBA in access??) and Queries using wildcard characters e.g. Like * *
I don't know if this is any help to you but de-duping is different for each organisation and for each set of data...there is no one template fits all.
Matt|||Sure...you'll get an answer very fast if you post the DDL
AND what you consider a dup to be...it's not always black and white with some people
A dup to me is everything on the row is exactly the same...
Not just the PK
If you have a PK...which you don't because then you wouldn't have a dup
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment