Wednesday, March 21, 2012

De-dupe question

Hi,
I need some help to de-duping an orders table. I have an orderitems table
containing OrderID, OrderItemID, (plus a few other columns), and we need to
clean
the data prior to migration. We have quite a few duplicates of orders, and
in moving to a template, or set based model, we want to create typical group
s
from the data.
If I can provide an OrderID as a parameter, can anyone let me know how I can
retrieve the set of all precisely matching candidates. The problem I'm
running into is that
I keep getting partial matches back, for example if OrderID 1 has OrderItems
1, 2 and 3, and OrderID 2 has OrderItems 2 and 3, I get OrderID 2 back as a
match for OrderID 1.
The query I'm using is as follows:
select * from OrderHistory where OrderItemID in (select OrderItemID from
OrderHistory where OrderID = 846863) and OrderID <> 846863.
This seems to match on full or subset rather than on complete match.
The columns of interest are (OrderID int, OrderItemID int)
Any help is greatly appreciated.
Thanks
mkCould you post DDL, sample data, and sample output?
http://www.aspfaq.com/5006
Adam Machanic
SQL Server MVP
http://www.sqljunkies.com/weblog/amachanic
--
"mk" <mk@.discussions.microsoft.com> wrote in message
news:E85B6C9F-EC19-4632-9E79-40ABE570AED3@.microsoft.com...
> Hi,
> I need some help to de-duping an orders table. I have an orderitems table
> containing OrderID, OrderItemID, (plus a few other columns), and we need
to
> clean
> the data prior to migration. We have quite a few duplicates of orders,
and
> in moving to a template, or set based model, we want to create typical
groups
> from the data.
> If I can provide an OrderID as a parameter, can anyone let me know how I
can
> retrieve the set of all precisely matching candidates. The problem I'm
> running into is that
> I keep getting partial matches back, for example if OrderID 1 has
OrderItems
> 1, 2 and 3, and OrderID 2 has OrderItems 2 and 3, I get OrderID 2 back as
a
> match for OrderID 1.
> The query I'm using is as follows:
> select * from OrderHistory where OrderItemID in (select OrderItemID from
> OrderHistory where OrderID = 846863) and OrderID <> 846863.
> This seems to match on full or subset rather than on complete match.
> The columns of interest are (OrderID int, OrderItemID int)
> Any help is greatly appreciated.
> Thanks
> mk|||CREATE TABLE [dbo].[OrderHistory] (
[OrderID] [int] NOT NULL ,
[OrderItemID] [int] NOT NULL
) ON [defgrp]
GO
There are other metadata columns but they are not relevant to this question.
A sample of the data I'm using is:
OrderID OrderItemID
846863 191731
846863 201746
846864 191731
846864 201746
846864 201747
846865 191731
846865 201746
846866 191732
846866 201747
846867 191732
846867 201747
846868 191732
846868 201747
My lastest SQL (getting more bloated and convoluted):
declare @.tot int
select @.tot = count(*) from _IntOrder where OrderID = 846863
select distinct i.OrderID from _IntOrder i where i.OrderItemID in (select
OrderItemID from _IntOrder where OrderID = 846863) and i.OrderID <> 846863
group by i.OrderID
having count(i.OrderItemID) = count(*)
I was hoping that this query would match the sets of Orders with the same
number of orderitems as 846863, where the sets were in 864863's membership,
i.e. identical orderitems, identical numbers of orderitems. As you can see
from the results this is not the case, I get 846864 as a match even though i
t
actually has an extra OrderItem - 201747.
Anyway, any help you can provide is greatly appreciated.
Kind regards,
mk
"Adam Machanic" wrote:

> Could you post DDL, sample data, and sample output?
> http://www.aspfaq.com/5006
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.sqljunkies.com/weblog/amachanic
> --
>
> "mk" <mk@.discussions.microsoft.com> wrote in message
> news:E85B6C9F-EC19-4632-9E79-40ABE570AED3@.microsoft.com...
> to
> and
> groups
> can
> OrderItems
> a
>
>|||I think this should do it:
select o2.orderid
from orderhistory o1
full join orderhistory o2 on o1.orderid = 846863
and o2.orderitemid = o1.orderitemid
and o2.orderid <> o1.orderid
group by o2.orderid
having count(o1.orderitemid) = count(o2.orderitemid)
Adam Machanic
SQL Server MVP
http://www.sqljunkies.com/weblog/amachanic
--
"mk" <mk@.discussions.microsoft.com> wrote in message
news:5A0AFFDF-86D3-423F-A03A-A9FD40AB7448@.microsoft.com...
> CREATE TABLE [dbo].[OrderHistory] (
> [OrderID] [int] NOT NULL ,
> [OrderItemID] [int] NOT NULL
> ) ON [defgrp]
> GO
> There are other metadata columns but they are not relevant to this
question.
> A sample of the data I'm using is:
> OrderID OrderItemID
> 846863 191731
> 846863 201746
> 846864 191731
> 846864 201746
> 846864 201747
> 846865 191731
> 846865 201746
> 846866 191732
> 846866 201747
> 846867 191732
> 846867 201747
> 846868 191732
> 846868 201747
> My lastest SQL (getting more bloated and convoluted):
> declare @.tot int
> select @.tot = count(*) from _IntOrder where OrderID = 846863
> select distinct i.OrderID from _IntOrder i where i.OrderItemID in (select
> OrderItemID from _IntOrder where OrderID = 846863) and i.OrderID <> 846863
> group by i.OrderID
> having count(i.OrderItemID) = count(*)
> I was hoping that this query would match the sets of Orders with the same
> number of orderitems as 846863, where the sets were in 864863's
membership,
> i.e. identical orderitems, identical numbers of orderitems. As you can
see
> from the results this is not the case, I get 846864 as a match even though
it
> actually has an extra OrderItem - 201747.
> Anyway, any help you can provide is greatly appreciated.
> Kind regards,
> mk
> "Adam Machanic" wrote:
>
table
need
orders,
I
I'm
as
from|||Hi,
Thanks for the response. Unfortunately this approach returns 846863 and
846865 as matches for 846864. The solution below is actually working for al
l
combinations, its just damn ugly, if anyone has any suggestions to clean it
up and improve performance or efficiency I'd very much appreciate it:
declare @.Order int, @.num int
declare @.vals table(ID1 int)
select @.Order = 846866
insert @.vals (ID1) select orderitemid from orderhistory where orderid =
@.Order
select @.num = count(*) from @.vals
select o.orderid from orderhistory o full outer join @.vals v on
o.orderitemid = v.id1
where o.orderid <> @.Order
group by o.orderid having count(o.orderitemid) = @.num and count(id1) = @.nu
m
Regards,
mk
"Adam Machanic" wrote:

> I think this should do it:
>
> select o2.orderid
> from orderhistory o1
> full join orderhistory o2 on o1.orderid = 846863
> and o2.orderitemid = o1.orderitemid
> and o2.orderid <> o1.orderid
> group by o2.orderid
> having count(o1.orderitemid) = count(o2.orderitemid)
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.sqljunkies.com/weblog/amachanic
> --
>
> "mk" <mk@.discussions.microsoft.com> wrote in message
> news:5A0AFFDF-86D3-423F-A03A-A9FD40AB7448@.microsoft.com...
> question.
> membership,
> see
> it
> table
> need
> orders,
> I
> I'm
> as
> from
>
>|||Here is a method I believe works... I had to re-read some texts on
Relational Division to get back into the right mindset. It's definitely a
rough thing to get your head around! I recommend you check out this post by
Joe Celko:
http://groups-beta.google.com/group...b1b2c
bb
To solve your problem, I plugged in the Exact Division pattern he provides:
SELECT o1.orderid
FROM orderhistory AS o1
LEFT OUTER JOIN
orderhistory AS o2
ON o1.orderitemid = o2.orderitemid
and o2.orderid = 846863
GROUP BY o1.orderid
HAVING COUNT(o1.orderid) = (SELECT COUNT(orderitemid) FROM orderhistory
where orderid=846863)
AND COUNT(o2.orderitemid) = (SELECT COUNT(orderitemid) FROM orderhistory
where orderid=846863)
Adam Machanic
SQL Server MVP
http://www.sqljunkies.com/weblog/amachanic
--
"mk" <mk@.discussions.microsoft.com> wrote in message
news:286E524E-8D3C-4FDF-A972-E069B76AB01E@.microsoft.com...
> Hi,
> Thanks for the response. Unfortunately this approach returns 846863 and
> 846865 as matches for 846864. The solution below is actually working for
all
> combinations, its just damn ugly, if anyone has any suggestions to clean
it
> up and improve performance or efficiency I'd very much appreciate it:
> declare @.Order int, @.num int
> declare @.vals table(ID1 int)
> select @.Order = 846866
> insert @.vals (ID1) select orderitemid from orderhistory where orderid =
> @.Order
> select @.num = count(*) from @.vals
> select o.orderid from orderhistory o full outer join @.vals v on
> o.orderitemid = v.id1
> where o.orderid <> @.Order
> group by o.orderid having count(o.orderitemid) = @.num and count(id1) =
@.num
>
> Regards,
> mk
>
> "Adam Machanic" wrote:
>
(select
846863
same
can
though
orderitems
we
typical
how
problem
back
OrderItemID
match.|||Thanks Adam, you're a star. That worked a treat (with a minor mod to preven
t
return of original OrderID), and thankfully avoids the hideous, horrendous
and shamefully embarassing use of the temporary table datatype!!!
Thanks also for the link, its certainly not easy to get good information on
these kind of subjects.
regards,
mk
"Adam Machanic" wrote:

> Here is a method I believe works... I had to re-read some texts on
> Relational Division to get back into the right mindset. It's definitely a
> rough thing to get your head around! I recommend you check out this post
by
> Joe Celko:
> http://groups-beta.google.com/group...b1b
2cbb
> To solve your problem, I plugged in the Exact Division pattern he provides
:
>
> SELECT o1.orderid
> FROM orderhistory AS o1
> LEFT OUTER JOIN
> orderhistory AS o2
> ON o1.orderitemid = o2.orderitemid
> and o2.orderid = 846863
> GROUP BY o1.orderid
> HAVING COUNT(o1.orderid) = (SELECT COUNT(orderitemid) FROM orderhistory
> where orderid=846863)
> AND COUNT(o2.orderitemid) = (SELECT COUNT(orderitemid) FROM orderhistor
y
> where orderid=846863)
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.sqljunkies.com/weblog/amachanic
> --
>
> "mk" <mk@.discussions.microsoft.com> wrote in message
> news:286E524E-8D3C-4FDF-A972-E069B76AB01E@.microsoft.com...
> all
> it
> @.num
> (select
> 846863
> same
> can
> though
> orderitems
> we
> typical
> how
> problem
> back
> OrderItemID
> match.
>
>

No comments:

Post a Comment