Showing posts with label performance. Show all posts
Showing posts with label performance. Show all posts

Thursday, March 29, 2012

DEFAULT keyword performance

I have a function which performs a query and returns a table. The one
parameter that can get passed in is a date which defaults to NULL.
There is an IF statement in the function that will set the paramter to
an actual date if null. If I call the function while passing in a date
the function comes back a second or 2 later. But if I pass in DEFAULT
to the function, the same query takes 8 minutes. See code below and
sample call below.

CREATE FUNCTION fCalculateProfitLossFromClearing (
@.TradeDate DATETIME = NULL
)
RETURNS @.t TABLE (
[TradeDate] DATETIME,
[Symbol] VARCHAR(15),
[Identity] VARCHAR(15),
[Exchange] VARCHAR(5),
[Account] VARCHAR(10),
[Value] DECIMAL(18, 6)
)
AS
BEGIN
-- Use previous trading date if none specified
IF @.TradeDate IS NULL
SET @.TradeDate = Supporting.dbo.GetPreviousTradeDate()

-- Make the query
INSERT @.t
SELECT
@.TradeDate,
tblTrade.[Symbol],
tblTrade.[Identity],
tblTrade.[Exchange],
tblTrade.[Account],
SUM((CASE tblTrade.[Side] WHEN 'B' THEN -ABS(tblTrade.[Quantity])
ELSE ABS(tblTrade.[Quantity]) END) * (tblPos.[ClosingPrice] -
tblTrade.[Price])) AS [Value]
FROM
Historical.dbo.ClearingTrade tblTrade
LEFT JOIN Historical.dbo.ClearingPosition tblPos ON (@.TradeDate =
tblPos.[TradeDate] AND tblTrade.[Symbol] = tblPos.[Symbol] AND
tblTrade.[Identity] = tblPos.[Identity])
WHERE
([TradeTimestamp] >= @.TradeDate AND [TradeTimestamp] < DATEADD(DAY,
1, @.TradeDate))
GROUP BY tblTrade.[Symbol],tblTrade.[Identity],tblTrade.[Exchange],tblTrade.[Account]

RETURN
END

If I call the function as

SELECT * FROM fCalculateProfitLossFromClearing('09/25/2003')

it returns in 2 seconds.

If I call the function as

SELECT * FROM fCalculateProfitLossFromClearing(DEFAULT)

in which GetPreviousTradeDate() will set @.TradeDate to 09/25/2003 it
returns in 8 minutes.[posted and mailed, please reply in news]

Jason (JayCallas@.hotmail.com) writes:
> I have a function which performs a query and returns a table. The one
> parameter that can get passed in is a date which defaults to NULL.
> There is an IF statement in the function that will set the paramter to
> an actual date if null. If I call the function while passing in a date
> the function comes back a second or 2 later. But if I pass in DEFAULT
> to the function, the same query takes 8 minutes. See code below and
> sample call below.

The query seems familiar. :-)

The reason for this is that when SQL Server builds the query plan,
it considers the value of the input parameter. When you provide an
explicit date, SQL Server can consult the statistics for the table
and see that the value you provided is very selective, and use the
index.

But if you provide NULL, SQL Server will build the query plan on that
assumption. Obviously a NULL value would return no rows, but SQL Server
never makes any assumptions that could yield incorrect results. Since
you WHERE condition is for a range, SQL Server appears to prefer to
scan the table, than using a non-clustered index.

--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||>
> The query seems familiar. :-)
> The reason for this is that when SQL Server builds the query plan,
> it considers the value of the input parameter. When you provide an
> explicit date, SQL Server can consult the statistics for the table
> and see that the value you provided is very selective, and use the
> index.
> But if you provide NULL, SQL Server will build the query plan on that
> assumption. Obviously a NULL value would return no rows, but SQL Server
> never makes any assumptions that could yield incorrect results. Since
> you WHERE condition is for a range, SQL Server appears to prefer to
> scan the table, than using a non-clustered index.

I hate the restart this thread but I have hit a brick wall...

I am at a lose of whether to creat functions or stored procedures (or
even views).

The below questions/issues are based on a need to return information
on a particular date for one to many symbols.

For my example lets say 15 symbols. You could do a query like Symbol =
'a' OR Symbol = 'b' OR Symbol... but it would make more sense to do
Symbol IN ('a','b',...))

I would also like to give my functions and stored procedures to use a
default date if none is specified. I created a function named
GetPreviousTradeDate() which does this based on a calendar.

SO here is how I see it.

Stored procedures seem to be the fastest in terms of returning data
back. But they seem to be limited in the sense that they can return
ONE row or ALL the rows since you cannot pass in a variable list of
symbols. You also cannot use the SP as part of a query. You could just
return all the rows back to the client and do filter or searching on
that end but that does not seem efficient or professional.

A function also does not let you pass in a variable list of symbols
but at least you can use it in a query. You could do something like
SELECT * FROM TheFunction() WHERE Symbol IN ('a','b',...). All this
happens at the server side and only the needed rows gets sent back.

But functions seem to have MAJOR performance problems when default
values are passed in. When I pass in a specific date the query takes a
few seconds but when I pass in DEFAULT and set the date to the results
of the GetPreviousTradeDate() function the query takes anywhere from 8
minutes to 15 minutes. (This even happens if I do not use the
GetPreviousTradeDate() function and set my parameter to a hard-coded
value)

Any thoughts or comments would be appreciated.|||Jason (JayCallas@.hotmail.com) writes:
> A function also does not let you pass in a variable list of symbols
> but at least you can use it in a query. You could do something like
> SELECT * FROM TheFunction() WHERE Symbol IN ('a','b',...). All this
> happens at the server side and only the needed rows gets sent back.

I happen to have an article on my web site that discusses this case
in detail. You don't have to read all of it, but you can just get
the function you need at
http://www.algonet.se/~sommar/array...html#iterative.

> But functions seem to have MAJOR performance problems when default
> values are passed in. When I pass in a specific date the query takes a
> few seconds but when I pass in DEFAULT and set the date to the results
> of the GetPreviousTradeDate() function the query takes anywhere from 8
> minutes to 15 minutes. (This even happens if I do not use the
> GetPreviousTradeDate() function and set my parameter to a hard-coded
> value)

The difference is not always that big, but apparently your query is
sensitive for this. I suggest that you split up the procedure in two:

EXEC outer_sp @.date = NULL datetime
IF @.date IS NULL
SELECT @.date = dbo.yourfunctionfordefault()
EXEC inner_sp @.date

And then inner_sp includes the actual query.

For a long treatise on this subject, search on Google news for articles
by Bart Duncan (a escalation engineer at Microsoft) and the phrase
"parameter sniffing".

--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||JayCallas@.hotmail.com (Jason) wrote in message news:<f01a7c89.0310141510.28e9c846@.posting.google.com>...
> I hate the restart this thread but I have hit a brick wall...
> I am at a lose of whether to creat functions or stored procedures (or
> even views).
> The below questions/issues are based on a need to return information
> on a particular date for one to many symbols.
> For my example lets say 15 symbols. You could do a query like Symbol =
> 'a' OR Symbol = 'b' OR Symbol... but it would make more sense to do
> Symbol IN ('a','b',...))
> I would also like to give my functions and stored procedures to use a
> default date if none is specified. I created a function named
> GetPreviousTradeDate() which does this based on a calendar.
> SO here is how I see it.
> Stored procedures seem to be the fastest in terms of returning data
> back. But they seem to be limited in the sense that they can return
> ONE row or ALL the rows since you cannot pass in a variable list of
> symbols. You also cannot use the SP as part of a query. You could just
> return all the rows back to the client and do filter or searching on
> that end but that does not seem efficient or professional.
> A function also does not let you pass in a variable list of symbols
> but at least you can use it in a query. You could do something like
> SELECT * FROM TheFunction() WHERE Symbol IN ('a','b',...). All this
> happens at the server side and only the needed rows gets sent back.
> But functions seem to have MAJOR performance problems when default
> values are passed in. When I pass in a specific date the query takes a
> few seconds but when I pass in DEFAULT and set the date to the results
> of the GetPreviousTradeDate() function the query takes anywhere from 8
> minutes to 15 minutes. (This even happens if I do not use the
> GetPreviousTradeDate() function and set my parameter to a hard-coded
> value)
> Any thoughts or comments would be appreciated.

Since the stored procedure has both the speed and the ability to use
default values without performance hit, would it be normal practice or
efficient to send separate queries for each symbol to the stored
procedure? This could result in anywhere from a few to several hundred
calls at a time.|||Jason (JayCallas@.hotmail.com) writes:
> Since the stored procedure has both the speed and the ability to use
> default values without performance hit, would it be normal practice or
> efficient to send separate queries for each symbol to the stored
> procedure? This could result in anywhere from a few to several hundred
> calls at a time.

That does not seem like a good idea. Certainly more efficient to get
data for all symbols at once. See my previous post for suggestions.

--
Erland Sommarskog, SQL Server MVP, sommar@.algonet.se

Books Online for SQL Server SP3 at
http://www.microsoft.com/sql/techin.../2000/books.asp|||>> A function also does not let you pass in a variable list of symbols
but at least you can use it in a query ... Any thoughts or comments
would be appreciated. <<

Ever try putting the list of symbols into a one column table and using
an "IN (SELECT parm FROM Parmlist)" instead?

Wednesday, March 21, 2012

Dedupe Query

Below, I have some T-SQL to create a table, add some sample records and do a dedupe query. This all works but performance is poor on large data sets and I was wondering if someone had any optimization tips.

Thanks to blindman for helping me develop this version.

Some background: the staging table has undeduped records about users. I want to dedupe and get the record with the most fields. I don't want to mix fields from different records. The staging table is currently unindexed but that can be changed.

CREATE TABLE StagingRecords
(
EmailAddress VARCHAR(75) NULL,
FirstName VARCHAR(255) NULL,
LastName VARCHAR(255) NULL,
StreetAddress VARCHAR(255) NULL,
City VARCHAR(255) NULL,
State VARCHAR(255) NULL,
ZipCode VARCHAR(255) NULL,
RecordID INT IDENTITY NOT NULL
)

INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('usera@.hotmail.com', 'John', 'Doe', NULL, NULL, NULL, NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('usera@.hotmail.com', 'Abe', 'Abelman', NULL, NULL, 'MI', NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('usera@.hotmail.com', 'Zach', 'Zedcynsky', NULL, NULL, 'TX', NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('usera@.hotmail.com', 'Mary', 'Jane', NULL, NULL, NULL, NULL)

INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('zzz@.yahoo.com', 'Cletus', 'Van Damme', NULL, NULL, NULL, NULL)
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('zzz@.yahoo.com', 'Alfonse', 'Ackbar', NULL, NULL, 'AL', '12345')
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('zzz@.yahoo.com', 'Zoom', 'Zuckerman', NULL, NULL, 'NJ', '54321')
INSERT INTO StagingRecords (EmailAddress, FirstName, LastName, StreetAddress, City, State, ZipCode)
VALUES ('zzz@.yahoo.com', 'Mary', 'Jane', NULL, 'Springfield', NULL, NULL)

SELECT
NULLCountTable.MinNULLCount, IDTable.TargetRecordID,
StagingRecords.EmailAddress, StagingRecords.FirstName, StagingRecords.LastName, StagingRecords.StreetAddress, StagingRecords.City, StagingRecords.State, StagingRecords.ZipCode
FROM
(SELECT EmailAddress
, MIN(CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END) AS MinNULLCount
FROM StagingRecords
GROUP BY EmailAddress) AS NULLCountTable
INNER JOIN
(SELECT Min(RecordID) AS TargetRecordID, EmailAddress
, (CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END) AS NULLCount
FROM StagingRecords
GROUP BY EmailAddress, (CASE WHEN StagingRecords.FirstName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.LastName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.StreetAddress IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.City IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.State IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StagingRecords.ZipCode IS NULL THEN 1 ELSE 0 END)) AS IDTable
ON (NULLCountTable.EmailAddress = IDTable.EmailAddress AND NULLCountTable.MinNULLCount = IDTable.NULLCount)
INNER JOIN StagingRecords ON (StagingRecords.EmailAddress = IDTable.EmailAddress AND StagingRecords.RecordID = IDTable.TargetRecordID)This has been running on an undeduped table of 170 million records for over 2.5 hours and has yet to output a single row.

Recently, people in this forum said I should use set based solutions such as this over cursors. I should get dramatic/exponential performance gains. And if I wasn't seeing that, then I wasn't doing it right. Well, I must not be doing this right so I'm asking for help.

This seems like it must do extra logic and sorting that the cursor based code doesn't have to do. For example, the cursor code doesn't need unique staging record IDs and never has to join on them or perform a fraction of the joining of this query.

The provided SQL should be say to run and experiment with on a tempdb.

Any help is much appreciated.|||There are probably more efficient solutions, but this approach is about 4 times faster than blindmans more complex query
if you build the view and indexes (which may make the whole thing a wash in the end)

Also, you suggested that your cursor solution appeared faster. It may return some rows faster than blindmans query, but
his suggestion is still fairly efficient. It will read the table 2-3 times but iterating through each row with a cursor,
even if you read each row only once, will still be MUCH SLOWER to complete the entire process.

SET CONCAT_NULL_YIELDS_NULL ON
SET ARITHABORT ON

--Create a view with a calculated colum for count of null records
CREATE VIEW v_StagingRecords WITH SCHEMABINDING AS
SELECT EmailAddress,
CASE WHEN FirstName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN LastName IS NULL THEN 1 ELSE 0 END
+ CASE WHEN StreetAddress IS NULL THEN 1 ELSE 0 END
+ CASE WHEN City IS NULL THEN 1 ELSE 0 END
+ CASE WHEN State IS NULL THEN 1 ELSE 0 END
+ CASE WHEN ZipCode IS NULL THEN 1 ELSE 0 END NullCount, RecordID
FROM dbo.StagingRecords

--3 seconds. 1700 reads
CREATE UNIQUE CLUSTERED INDEX vSR_IDX on v_StagingRecords (RecordID)
CREATE INDEX v_EN_IDX on v_StagingRecords (EmailAddress, NullCount)

--100,000 row table. 0.6 seconds, 800 reads vs 2.8 seconds, 2200 reads for original query.
SELECT * FROM v_StagingRecords WHERE RecordID IN
(
SELECT (SELECT TOP 1 RecordID FROM v_StagingRecords WITH (NOEXPAND)
WHERE EmailAddress = o.EmailAddress Order BY NullCount ASC) RecordID
FROM v_StagingRecords o WITH (NOEXPAND)
GROUP BY EmailAddress
)
ORDER BY EmailAddress|||I loaded 17 million rows into a test table and built the view, indexes, and ran the query. It took 16 minutes in total, 101 seconds for the query itself. I tried blindmans query on the same data it it took 113, seconds so scrap my suggestion. I might play with it a little more and see if I can come up with anything better.

Monday, March 19, 2012

DecryptByCert performance

I have bunch of encrypted rows in the table and have stored procedure to select those rows.
It looks like this
SELECT CAST(DecryptByCert(Cert_ID('CertId'), field1) AS VARCHAR) AS f1,
CAST(DecryptByCert(Cert_ID('CertId'), field2) AS VARCHAR) AS f2,
CAST(DecryptByCert(Cert_ID('CertId'), field3) AS VARCHAR(255)) AS f3
FROM [table]

This stored procedure takes really long time even with hundreds of rows, so I suspect that I do something wrong. Is there any way to optimize this stored procedure?

Encryption/decryption by an asymmetric key (i.e. certificate) is much slower than when using a symmetric key.The recommended way to encrypt data is by using a symmetric key (or set of keys if you prefer) for protecting the data (i.e. AES or 3DES key), and protect these key using the certificate.

Another limitation you should consider is that asymmetric key encryption in SQL Server 2005 is limited to only one block of data. This means that the maximum amount of plaintext you can encrypt is limited by the modulus of the private key. In the case of RSA 1024 (in case you are using SQL Server 2005-generated certificates), the maximum plaintext is 117 bytes.

I am also including an additional thread from this forum that talk about this topic:

http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=114314&SiteID=1

Thanks a lot,

-Raul Garcia

SDE/T

SQL Server Engine

Decreased performance from using blobs

I'm trying to store files such as pdfs in my SQL Server as blobs for a particular .NET application and I'm noticing a rather large performance hit when compared to if the pdf was simply stored in the web server's file system.

I have a pdf of 7MB. If this is on a web server it takes roughly 9 seconds to download. When I store the same pdf in my database and then write it onto the page using the code below it takes roughly 40 seconds to display. Can anyone help me to improve this performance difference? Should I not be using this method to store large files?

Dim PdfCol As Integer = 0 ' the column # of the BLOB field
Dim dr As SqlDataReader = cmd.ExecuteReader()

If dr.Read() Then
Dim mypdf(dr.GetBytes(PdfCol, 0, Nothing, 0, Integer.MaxValue) - 1) As Byte
dr.GetBytes(PdfCol, 0, mypdf, 0, mypdf.Length)
End If

Response.Clear()
Response.Buffer = True
Response.ClearContent()
Response.ClearHeaders()
Response.ContentType = "application/pdf"
Response.BinaryWrite(mypdf)
Response.Flush()
Response.Close()* Turn off output cache, so that the first data can be sent immediatly.
* Do NOT load the whole thing into memory and send it out. Instead grab little chunks and thend them. Start with 8k, then take like 64k a trip.
Currently you only start sending data once the hole 7mb are loaded from the database, and this is ridiculously inefficient.|||By output cache, do you mean the output in the Page/UC directive?
Also, could you show me an example of grabbing chunks and sending them out? I'm not real experienced with using bytes.

I tried the following but I get a dialogue window that says "The file is damaged and can not be repaired."

If dr.Read() Then
Dim intBufferSize As Integer = 64000
Dim longStartIndex As Long = 0
Dim longRetVal As Long
Dim b(intBufferSize) As Byte

'Read through the first chunk of the document
longRetVal = dr.GetBytes(PdfCol, 0, b, 0, intBufferSize)

'Reset the start index.
longStartIndex = 0

Response.Clear()
Response.Buffer = True
Response.ClearContent()
Response.ClearHeaders()
Response.ContentType = "application/pdf"

'Continue reading and writing while there are bytes beyond the size of the buffer.
While longRetVal = intBufferSize
Response.BinaryWrite(b)
Response.Flush()

'Reposition the start index to the end of the last buffer and fill the buffer.
longStartIndex += intBufferSize
longRetVal = dr.GetBytes(PdfCol, longStartIndex, b, 0, intBufferSize)
End While

'Write the remaining buffer.
Response.BinaryWrite(b)
Response.Flush()
Response.Close()
End If|||Ok, I've managed to write my blob in chunks to a specified path on my hard drive using the code from: http://support.microsoft.com/default.aspx?kbid=317043
But whenever I try to use a Response.BinaryWrite so that the pdf is loaded onto the web page with ContentType set to "application/pdf" or "application/octet-stream" I get the same error that the file is damaged and cannot be repaired.|||Thona you are awesome! I forgot to mention that earlier :)

I did finally get this to work. Apparently if my buffer size is too large the file gets corrupted. I set my buffer size to 24000 and everything seems to work ok, but if the buffer size is 29000 then the file gets corrupted. Anyone know why this is? Can bytes only be a certain size?