Wednesday, March 21, 2012

Deduping to multiple outputs

I know that you can dedupe with the Sort transformation, but that seems to drop the dupes completely. Is there a way to dedupe and have the uniques go to one output, while the dupes go to another?One way that I've accomplished this is to Multicast the source. Feed one dataset to an Aggregate transform selecting count(*) and the duplicated fields. The output of the Aggregate is then fed into a Conditional Split with one criteria being count(*) == 1 and the other being count(*) > 1. You can then Merge Join the two outputs with another instance of the original multicast and continue on with your flow.
Its a round about way of getting there, but it works. It's also not very performant because each instance of the Multicast requires a memcopy for each row.
Larry

No comments:

Post a Comment