Below, I have some T-SQL to create a table, add some sample records and do a dedupe query. This all works but performance is poor on large data sets and I was wondering if someone had any optimization tips.
Thanks to blindman for helping me develop this version.
Some background: the staging table has undeduped records about users. I want to dedupe and get the record with the most fields. I don't want to mix fields from different records. The staging table is currently unindexed but that can be changed.
CREATE TABLE StagingRecords
(
EmailAddress VARCHAR(75) NULL,
FirstName VARCHAR(255) NULL,
LastName VARCHAR(255) NULL,
StreetAddress VARCHAR(255) NULL,
City VARCHAR(255) NULL,
State VARCHAR(255) NULL,
ZipCode VARCHAR(255) NULL,
RecordID INT IDENTITY NOT NULL
)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('usera@.hotmail.com', 'John', 'Doe', NULL, NULL, NULL, NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('usera@.hotmail.com', 'Abe', 'Abelman', NULL, NULL, 'MI', NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('usera@.hotmail.com', 'Zach', 'Zedcynsky', NULL, NULL, 'TX', NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('usera@.hotmail.com', 'Mary', 'Jane', NULL, NULL, NULL, NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('zzz@.yahoo.com', 'Cletus', 'Van Damme', NULL, NULL, NULL, NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('zzz@.yahoo.com', 'Alfonse', 'Ackbar', NULL, NULL, 'AL', '12345')
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('zzz@.yahoo.com', 'Zoom', 'Zuckerman', NULL, NULL, 'NJ', '54321')
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('zzz@.yahoo.com', 'Mary', 'Jane', NULL, 'Springfield', NULL, NULL)
SELECT
NULLCountTable.MinNULLCount, IDTable.TargetRecordID,
StagingRecords.EmailAddress, StagingRecords.FirstName, StagingRecords.LastName, StagingRecords.StreetAddress, StagingRecords.City, StagingRecords.State, StagingRecords.ZipCode
FROM
(SELECT EmailAddress
, MIN(CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END) AS MinNULLCount
FROM StagingRecords
GROUP BY EmailAddress) AS NULLCountTable
INNER JOIN
(SELECT Min(RecordID) AS TargetRecordID, EmailAddress
, (CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END) AS NULLCount
FROM StagingRecords
GROUP BY EmailAddress, (CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END)) AS IDTable
ON (NULLCountTable.EmailAddress = IDTable.EmailAddress AND NULLCountTable.MinNULLCount = IDTable.NULLCount)
INNER JOIN StagingRecords ON (StagingRecords.EmailAddress = IDTable.EmailAddress AND StagingRecords.RecordID = IDTable.TargetRecordID)This has been running on an undeduped table of 170 million records for over 2.5 hours and has yet to output a single row.
Recently, people in this forum said I should use set based solutions such as this over cursors. I should get dramatic/exponential performance gains. And if I wasn't seeing that, then I wasn't doing it right. Well, I must not be doing this right so I'm asking for help.
This seems like it must do extra logic and sorting that the cursor based code doesn't have to do. For example, the cursor code doesn't need unique staging record IDs and never has to join on them or perform a fraction of the joining of this query.
The provided SQL should be say to run and experiment with on a tempdb.
Any help is much appreciated.|||There are probably more efficient solutions, but this approach is about 4 times faster than blindmans more complex query
if you build the view and indexes (which may make the whole thing a wash in the end)
Also, you suggested that your cursor solution appeared faster. It may return some rows faster than blindmans query, but
his suggestion is still fairly efficient. It will read the table 2-3 times but iterating through each row with a cursor,
even if you read each row only once, will still be MUCH SLOWER to complete the entire process.
SET CONCAT_NULL_YIELDS_NULL ON
SET ARITHABORT ON
--Create a view with a calculated colum for count of null records
CREATE VIEW v_StagingRecords WITH SCHEMABINDING AS
SELECT EmailAddress,
CASE WHEN FirstName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN LastName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StreetAddress IS NULL THEN 1 ELSE 0 END
+ CASE WHEN City IS NULL THEN 1 ELSE 0 END
+ CASE WHEN State IS NULL THEN 1 ELSE 0 END
+ CASE WHEN ZipCode IS NULL THEN 1 ELSE 0 END NullCount, RecordID
FROM dbo.StagingRecords
--3 seconds. 1700 reads
CREATE UNIQUE CLUSTERED INDEX vSR_IDX on v_StagingRecords (RecordID)
CREATE INDEX v_EN_IDX on v_StagingRecords (EmailAddress, NullCount)
--100,000 row table. 0.6 seconds, 800 reads vs 2.8 seconds, 2200 reads for original query.
SELECT * FROM v_StagingRecords WHERE RecordID IN
(
SELECT (SELECT TOP 1 RecordID FROM v_StagingRecords WITH (NOEXPAND)
WHERE EmailAddress = o.EmailAddress Order BY NullCount ASC) RecordID
FROM v_StagingRecords o WITH (NOEXPAND)
GROUP BY EmailAddress
)
ORDER BY EmailAddress|||I loaded 17 million rows into a test table and built the view, indexes, and ran the query. It took 16 minutes in total, 101 seconds for the query itself. I tried blindmans query on the same data it it took 113, seconds so scrap my suggestion. I might play with it a little more and see if I can come up with anything better.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment