Thursday, June 2, 2011

SQL Server Indexes – An Insider Look - Part 3


Non-Clustered Index

What we saw till now is a very ideal situtation where we have a well defined key and can go ahead to create a clustered index. But what if the data is not so organized but we still need to have an index on the column. Enter the world of Non-Clustered Index. Where actually the data is not organized or ordered in any pre-existing condition, but still satisfy the purposes. The major difference is leaf level does not contain data. It includes pointers to the relevant data and the pointer is a simple RID column. RID columns include the information on extent, page and row offset.  Using this RID column it will go and fetch the data from different pages.
Another issue with with non-clustered index is it behaves differently for heap and clustered table. Here again we will see two scenarios. One Non-clustered index created on a heap and then created on a clustered table.

Scenario 3: Non-Clustered Index on Heap Table

Let us create a table called PRODUCTS as follows:
CREATE TABLE PRODUCTS (PRODUCTNUMBER NUMERIC(6), PRODUCTNAME VARCHAR(30), PRODUCTDESCRIPTION VARCHAR(50), PRODUCTTYPE VARCHAR(1))
Now let us create a non-clustered index as follows:
CREATE NONCLUSTERED INDEX PRODIDX ON PRODUCTS(PRODUCTNUMBER, PRODUCTNAME)
The following script will insert the data into this table.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=1
WHILE (@COUNTER <= 50000)
            BEGIN
                        INSERT INTO PRODUCTS VALUES
                        (CAST(@COUNTER AS VARCHAR),
                        'PRODUCTS ' + CAST(@COUNTER AS VARCHAR),
                        'DESCRIPTION ' + CAST(@COUNTER AS VARCHAR),
                        (case (@COUNTER % 2) when 1 then 'I' when 0 then 'R' else NULL end))
                        SET @COUNTER=@COUNTER+1
            END
END
Now let us use a simple select clause and observe the execution plan.
SELECT * FROM PRODUCTS WHERE PRODUCTNUMBER=2089
If you see this execution plan carefully, the following can be observed:
a)      Since we do not have clustered index in this table, it has used RID lookup to retrieve other two columns PRODUCTDESCRIPTION and PRODUCTTYPE
b)      It has used Index Seek to retrieve indexed columns PRODUCTNUMBER and PRODUCTNAME
c)       Internally what happens is the optimizer uses NESTED LOOPS to join the data from the results of the two operations viz Index Seek and RID Lookup.
Is this kind of execution plan is bad? Depends on the business needs and execution time. The above query took couple of seconds to execute. Let us now select only the indexed columns and see what is the execution plan for the same query.
SELECT PRODUCTNUMBER, PRODUCTNAME FROM PRODUCTS WHERE PRODUCTNUMBER=2089
Now the execution plan is changed, no RID lookup because we have selected only the indexed columns. What if we wanted to avoid RID lookup and at the same time most of the columns in the table to be selected? Use Include option along with the definition of the index creation as follows:
DROP INDEX PRODUCTS.PRODIDX
CREATE NONCLUSTERED INDEX PRODIDX ON PRODUCTS(PRODUCTNUMBER, PRODUCTNAME) INCLUDE PRODUCTDESCRIPTION, PRODUCTTYPE
The following select statement provides an execution plan
SELECT PRODUCTNUMBER, PRODUCTNAME FROM PRODUCTS WHERE PRODUCTNUMBER=2089
Please note that we did not get RID lookup, which is because of the columns included in the index creation. However, please be careful with INCLUDE column option. Please do benchmark before using it as it occupies more space in LEAF level and there on reducing the number of rows and increasing the number of IOs.

Scenario 4: Non-Clustered Index on Clustered Table

Now, in the same PRODUCTS table I have dropped the indexes and created two indexes as follows:
DROP INDEX PRODUCTS.PRODIDX
CREATE CLUSTERED INDEX PRODCLUSTIDX ON PRODUCTS(PRODUCTNUMBER)
CREATE NONCLUSTERED INDEX PRODIDX ON PRODUCTS(PRODUCTNAME)

Again we have the same number of records viz 50,000 records in the table. The following select statement produces the below execution plan:
Select * from products where productname=’PRODUCTS 9201’
 
If you read the execution plan, the following can be observed:
a)      Since we have clustered index in this table, it has used Key Lookup to retrieve other two columns PRODUCTDESCRIPTION and PRODUCTTYPE
b)      It has used Index Seek to retrieve indexed columns PRODUCTNUMBER and PRODUCTNAME
c)       Internally what happens is the optimizer uses NESTED LOOPS to join the data from the results of the two operations viz Index Seek and Key Lookup.
Further please note that in case if the KEY LOOKUP is to be avoided, use INCLUDE option while creating the index.
Non-Clustered index can be used as part of heap table or as part of clustered table. There is no hard and fast rule that it has to be used only with heap or clustered table. Before using the same we need to benchmark and test it completely.

Further Reading
Please don’t forget to visit this site to read my next articles on
Ø       Joins and Sets
Ø       Hierarchical Queries





SQL Server Indexes – An Insider Look - Part 2


Now let us start loading the data using the following script.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=300000
WHILE (@COUNTER <= 500000)
            begin
            INSERT INTO ACCOUNTS VALUES (@COUNTER, ('ACCOUNT ' + CAST(@COUNTER AS VARCHAR)), ('ACCOUNT DESC ' + CAST(@COUNTER AS VARCHAR)), (CASE (CAST(@COUNTER AS VARCHAR) % 4) WHEN 1 THEN 'A' WHEN 2 THEN 'L' WHEN 3 THEN 'E' WHEN 4 THEN 'I' ELSE NULL END))
            set @COUNTER=@COUNTER+1               
            end
END
It took 1 minute to execute the above script and Page Splits per second increased from 18,562 to 19,604. It took just 1042 page splits per second to load 150,000 records. Fragment details are provided below:
 
Now let us start working on the second set of data and observe the page splits and fragmentation.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=300000
WHILE (@COUNTER <= 500000)
            begin
            INSERT INTO ACCOUNTS VALUES (@COUNTER, ('ACCOUNT ' + CAST(@COUNTER AS VARCHAR)), ('ACCOUNT DESC ' + CAST(@COUNTER AS VARCHAR)), (CASE (CAST(@COUNTER AS VARCHAR) % 4) WHEN 1 THEN 'A' WHEN 2 THEN 'L' WHEN 3 THEN 'E' WHEN 4 THEN 'I' ELSE NULL END))
            set @COUNTER=@COUNTER+1               
            end
END
It took 1 minute and 39 seconds to execute the above script and Page Splits per second increased from 19,604 to 21,030. It took just 1426 page splits per second to load 200,000 records. Fragment details are provided below:
 
The most interesting part is the third set, which actually inserts the data in between the existing data. This will increase the page splits further and cause the cascading effects on the remaining data as well. This should not cause any problem if the clustering key is defined properly. Let us execute the following script to insert the missing data.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=150001
WHILE (@COUNTER <= 299999)
            begin
            INSERT INTO ACCOUNTS VALUES (@COUNTER, ('ACCOUNT ' + CAST(@COUNTER AS VARCHAR)), ('ACCOUNT DESC ' + CAST(@COUNTER AS VARCHAR)), (CASE (CAST(@COUNTER AS VARCHAR) % 4) WHEN 1 THEN 'A' WHEN 2 THEN 'L' WHEN 3 THEN 'E' WHEN 4 THEN 'I' ELSE NULL END))
            set @COUNTER=@COUNTER+1               
            end
END

It took 1 minute to execute the above script and Page Splits per second increased from 21,030 to 22,099. It took just 1069 page splits per second to load 150,000 records. Fragment details are provided below:
 

Scenario 2: Not So Well Defined Clustered Key

Now let us change the scenario and use a calculated column and index the same. By calculated column, I mean a column that is derived from another column or a combination of values that can be put in one column. For this, I will create another table called PurchaseMaster as follows:
CREATE TABLE [dbo].[purchasemaster](
            [purchasenumber] [varchar](20) NULL,
            [purchasedescription] [varchar](30) NULL,
            [vendorcode] [varchar](15) NULL,
            [dateofpurchase] [datetime] NULL
)
A clustered index was also created on the column purchasenumber as follows:
CREATE CLUSTERED INDEX PURCHASEMASTERIDX ON PURCHASEMASTER(PURCHASENUMBER)
The column PURCHASENUMBER is derived on the basis of date of purchase. For example, the following is the sample data available in the PURCHASENUMBER column.
purchasenumber
1\2011-2012
10\2011-2012
100\2011-2012
1000\2011-2012
101\2011-2012
102\2011-2012
103\2011-2012
104\2011-2012
105\2011-2012
106\2011-2012
Please note that each purchase number is appended with the financial year. The purchase number 100 has the financial year 2011-2012 and the same for the other purchase numbers. As soon as the person who is raising the purchase order looks at this, the person will know the financial year details. This is indeed very useful and infact very handy as well. Also, when we issue the following statement, we did not see any anomaly.
SELECT * FROM PURCHASEMASTER WHERE PURCHASENUMBER='1000\2011-2012'
The execution plan for the above select statement is provided below:
 
Now, if we see a range filter based query as follows, the results were not proper.
SELECT * FROM PURCHASEMASTER WHERE PURCHASENUMBER>'1\2011-2012' AND PURCHASENUMBER <'101\2011-2012'
purchasenumber            purchasedescription        vendorcode       dateofpurchase
10\2011-2012    Purchase 10                   Vendor 10         2011-05-30 19:08:48.653
100\2011-2012  Purchase 100                 Vendor 100        2011-05-30 19:08:48.687
1000\2011-2012 Purchase 1000               Vendor 1000      2011-05-30 19:08:49.217
Now there are no records in the table and entire data was truncated. We will now study the page splits and the fragmentation details for this clustered index. Initially the page splits is calculated using the following query and was observed to be 9325.
SELECT * FROM SYS.SYSPERFINFO WHERE COUNTER_NAME = 'PAGE SPLITS/SEC' AND OBJECT_NAME LIKE '%ACCESS METHODS%'
Also, the fragmentation details for this table and index is illustrated below:
 
Now we will insert two sets of data, purchase number from 1 to 150,000 and from 300,000 to 500,000 with a gap of 150,000 between the sets of data. After inserting every set of data, we will observe the corresponding page splits and fragment.
The following script will be used to generate these sets of data.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=1
WHILE (@COUNTER <= 150000)
            BEGIN
                        INSERT INTO PURCHASEMASTER VALUES
                        (CAST(@COUNTER AS VARCHAR) + '\' + cast(YEAR(GETDATE()) as varchar) + '-' + cast(YEAR(GETDATE())+1 as varchar),
                        'PURCHASE ' + CAST(@COUNTER AS VARCHAR),
                        'VENDOR ' + CAST(@COUNTER AS VARCHAR),
                        GETDATE())
                        SET @COUNTER=@COUNTER+1
            END
END
Inserting 150,000 records took just 1 minute and 3 seconds. The page splits/second and fragmentation after this are provided below:
From 9325, page splits/second has increased to 11523 which caused 2198 page splits per second during this insert operation.
Also the fragment details are provided below:
 
Please note that the fragmentaion is observed to be high in this case. Now, let us consider the next set, which is inserting next set of 200,000 records from 300,000 to 500,000.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=300000
WHILE (@COUNTER <= 500000)
            BEGIN
                        INSERT INTO PURCHASEMASTER VALUES
                        (CAST(@COUNTER AS VARCHAR) + '\' + cast(YEAR(GETDATE()) as varchar) + '-' + cast(YEAR(GETDATE())+1 as varchar),
                        'PURCHASE ' + CAST(@COUNTER AS VARCHAR),
                        'VENDOR ' + CAST(@COUNTER AS VARCHAR),
                        GETDATE())
                        SET @COUNTER=@COUNTER+1
            END
END
Inserting 200,000 records took just 1 minute and 18 seconds. The page splits/second and fragmentation after this are provided below:
From 11,523, page splits/second has increased to 14,947 which caused 3,424 page splits per second during this insert operation. Also the fragment details are provided below:
 
Wow, that is a great increase in fragmentation.
Now the final is what more important to observe both Page Splits and Fragmentation. We are now going to insert the missing records between 150,000 and 300,000 and observe the page splits. I will also explain diagramatically how this would happen. Now let us execute the same script by changing the values.
BEGIN
DECLARE @COUNTER AS INT
SET @COUNTER=150001
WHILE (@COUNTER <= 299999)
            BEGIN
                        INSERT INTO PURCHASEMASTER VALUES
                        (CAST(@COUNTER AS VARCHAR) + '\' + cast(YEAR(GETDATE()) as varchar) + '-' + cast(YEAR(GETDATE())+1 as varchar),
                        'PURCHASE ' + CAST(@COUNTER AS VARCHAR),
                        'VENDOR ' + CAST(@COUNTER AS VARCHAR),
                        GETDATE())
                        SET @COUNTER=@COUNTER+1
            END
END
The third batch took just 57 seconds to complete. Page Splits per second was observed to be 17,520 which caused 2,573 page splits per second during this insert operation. Also the fragment details are provided below:
 
Now having said this on page splits and fragmentation, how do we know if this is good or bad? Let us compare the scenario 1 and 2 in the following table and observe the results.
Description
Execution Time (mm:ss)
Page Splits
Fragmentation (AVG_FRAGMENTATION_ IN_PERCENT
FOR INDEX_LEVEL=0)
Scenario 1
1-150,000
01:00
1,042
0.68
300,000-500,000
01:39
1,426
0.57
150,001-299,999
01:00
1,069
0.62
Scenario 2
1-150,000
01:03
2,198
35.21
300,000-500,000
01:18
3,424
41.87
150,001-299,999
00:57
2,573
43.36
This shows in Scenario 2, as the cluster key is not well defined, the inserts to fill the missing gap (150,001 to 299,999) is observed to be high. Also, the fragmentation and Page Splits are quite high since the cluster key is not well defined. Whenever we need to define a clustered index on a key, please make sure the following:
a)      Key should be well defined, satisfying the business requirements and as well as the providing a greater selectivity.
b)      Key should not be defined on derived table and calculated column
c)       Other columns in the table should be completely dependent on the cluster key