Wednesday, March 21, 2012

De-duping data

I have a list of companies that all have the same Company Id but have formatting issues with the names. For example I have the following:

CompanyId Name
12345 Microsoft
12345 Microsoft Inc.
12345 Microsoft, Inc.

I figured I would use a simple data flow task to insert the company id and name into a company table where the primary key is the company id (I would then ignore the errors of duplicate keys). I was expecting to see the first instance of the company inserted into the company table and the dupes ignored. That's not the case - I get an error on duplicate primary keys. Can I not do this?

jrp210 wrote:

I have a list of companies that all have the same Company Id but have formatting issues with the names. For example I have the following:

CompanyId Name
12345 Microsoft
12345 Microsoft Inc.
12345 Microsoft, Inc.

I figured I would use a simple data flow task to insert the company id and name into a company table where the primary key is the company id (I would then ignore the errors of duplicate keys). I was expecting to see the first instance of the company inserted into the company table and the dupes ignored. That's not the case - I get an error on duplicate primary keys. Can I not do this?

This is more of a transact-sql question. The observed behavior is correct. In the sense of SSIS, you can use a sort transformation to eliminate dups. Or, you can structure your source query to do more of a select companyid, max(name) from table group by companyid.|||

You are right. But, how would a sort transorm work to eliminate dupes?

|||

jrp210 wrote:

You are right. But, how would a sort transorm work to eliminate dupes?

There is an option in the sort transformation (a check box) to eliminate dups. You'd have to sort only on the companyID though.

sql

No comments:

Post a Comment