Recently we had really nice meetup gathering here in Auckland NZ. Was a beautiful evening, there were 20 people talking about SQL server, was perfect. One of the topics was “String functions in the where clause and bad estimates”
For this demo I used SQL Server 2014 Developer Edition.
As a database I used StackOverflow, you can download this DB from Brent Ozar web page, just follow the instructions
How to Download the Stack Overflow Database via BitTorrent
Using string function in the WHERE clause can cause bad estimates.
I will query TAB Badges (has just clustered index on ID column). In the WHERE clause I will use STRING FUNCTION LEFT. LEFT Returns the left part of a character string with the specified number of characters.
We can see that we have 155202 logical reads, and query returning 2511 rows
(2511 row(s) affected)
Table ‘Badges’. Scan count 1, logical reads 155202, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Since we do not have Non-Clustered index on that table optimizer use Clustered index to scan whole table
If you want to check size of your indexes you can use following query
Lets now check the estimates for that Clustered Index Scan
For 2511 Actual Number of Rows we have Estimated 2261260, over two million rows, and since we have a scan we are reading every single row in that table 25125061, over 25 million rows
Exact number of rows for Badges table.
Lets try now next query
It is the same query but this time without LEFT function within WHERE clause
Execution plan is still the same.
But estimates are more accurate now. Estimated number of rows is now 6748, actual number of rows is the same as before as well as number of rows read, since we still do not have non-clustered index on that table.
If we change our WHERE clause little bit more and use ‘=’ instead of ‘like’ we will have equal number of estimates and actual rows number. Number of rows read stay the same until we cover this query with non-clustered index
So lets add that index
If we run the same query again we will have following plan.
We can see that query optimizer using non-clustered index seek to retrieve data from the table and our estimates, actual number of rows and actual number of rows read are the same, 2511.
Now we have just 15 logical reads, instead of 155202 logical reads like in the first query.
(2511 row(s) affected)
Table ‘Badges’. Scan count 1, logical reads 15, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
In the following example you can see how the bad estimates can cause excessive memory grant.
with column statistics, Sql server uses column histogram to estimate number of rows
I agree with you, this was just an example how estimated number of rows can be off with an actual.