Flashcards in Chapter 15 Implementing Indexes and Statistics Deck (61)
What is a page?
SQL Server internally organizes data in a data file in pages. A page is an 8KB unit and belongs to a single object; for example, to a table or an index. A page is the smallest unit of reading and writing.
Pages are physical structures.
What is an extent?
An extent consists of eight consecutive/contiguous pages. Pages from an extent can belong to a single object (uniform extent) or to multiple objects (mixed extent). SQL Server stores the first 8 pages of an object in mixed extents. When an object exceeds 8 pages, SQL Server allocates additional uniform extents for this object. With this type of organization, small objects waste less space and big objects are less fragmented.
What are the two types of logical structures SQL Server uses organize data in pages?
Heaps or Balanced Trees (B-Trees)
What is a heap?
The default structure for organizing pages is called a heap. Heaps occur when a B-Tree is not used to organize the data pages in a table. Specifically, if you do not create a clustered index explicitly (or implicitly through a PK or UNQ constraint), then a table is organized as a heap.
SQL Server can find data in a heap only by scanning the whole heap. SQL Server uses IAM pages to scan heaps in physical order or allocation order. Even if your query wants to retrieve only a single row, SQL Server has to scan the entire heap.
SQL Server stores new rows anywhere in the heap. They are inserted without an enforced order. It can store a new row in an existing page if the pages is not full or allocate a new page or extent for the object.
SQL Server does not allocate any pages for a table when you create it. It allocates the first page and also the first IAM page, when you insert the first row in the table.
What is an IAM page?
IAM stands for Index Allocation Map. These pages allow SQL Server to trace which pages and extents belong to an object. Every table or index has at least one IAM page called "first IAM". A single IAM page can point to approximately 4 GB of space. Large objects can have more than one IAM page. IAM pages are organized as a doubly linked list: each page has a pointer to its descendent and antecedent.
What does the sys.indexes catalog view provide?
sys.indexes provides general information about tables and indexes. The type column stores a value of 0 for heaps, 1 for clustered tables (indexes), and 2 for nonclustered indexes.
What does the sys.dm_db_index_physical_stats DMV provide?
These objects tell you how many pages are allocated for an object, page space (KB), row counts, internal/external fragmentation, etc.
For clustered indexes, it shows the index depth, index level, and page count per level.
What is internal fragmentation?
Internal fragmentation means that pages are not full. The more rows you have stored on a single page, the fewer pages SQL Server must read to retrieve these rows, and the less memory it uses for cached pages for the same number of rows.
In heaps you do not get much internal fragmentation because SQL Server stores new rows in existing pages if there is enough room.
However, in clustered indexes, when you insert a new row into a full page, SQL Server has to split the page into two pages and move half of the rows into the second page. This happens because SQL Server needs to maintain the logic order of rows. This way, you get some internal fragmentation.
What is external fragmentation?
Also, known as logical fragmentation, external fragmentation occurs when pages are physically out of order.
External fragmentation can slow down full or partial scans in logical order.
What is a balanced tree?
You organize a table as a balanced tree when you create a clustered index. The structure is called a balanced tree because it resembles an inverse tree. Every balanced tree has a single root page and at least one or more leaf pages. In addition, it can have zero or more intermediate levels.
All data in a clustered table is stored in leaf pages. Data is stored in logical order of the clustering key and is not physically ordered. SQL Server still uses IAM pages to follow the physical allocation.
Pages above the leaf level point to leaf-level pages. A row in a page above the leaf level contains a clustering key value and a pointer to a page where this value starts in logically ordered leaf level. SQL Server creates intermediate level pages which point to leaf level pages as necessary when it can no longer reference all leaf level pages.
Pages on the same level are organized as a doubly linked list; therefore, SQL Server can find the previous and the next page in logical order.
What is a clustering key?
A clustering key can consist of a single column or multiple columns. If the key consists of multiple columns, then this is a composite key.
You can have up to 16 columns in a key; the size of all columns together in a composite key must not exceed 900 bytes.
Can you use non-unique values as the key of a clustered index?
Yes. You can use a column or columns with unique or non-unique values for a key of a clustered index. However, SQL Server internally always maintains uniqueness of the clustering key. It adds a uniquifier value which is a sequential integer to the repeating values. The first value is stored without the uniquifier; the first repeating value gets a uniquifier with a value of 1, the second 2, and so on.
How can you control internal and external fragmentation?
You can control the internal fragmentation with the FILLFACTOR option for the leaf-level pages and with the PAD_INDEX option for the higher-level pages of the CREATE INDEX statement.
You can also rebuild or reorganize an index to get rid of the external fragmentation by using the ALTER INDEX...REORGANIZE or ALTER INDEX...REBUILD statements.
Why should you use a short clustering key?
A short clustering key means that more rows can fit on pages above the leaf level. Therefore, fewer levels are potentially needed. Fewer levels means a more efficient index because SQL Server needs to read fewer pages to find a row. A uniquifier extends the key; therefore, having a short and unique key is preferred for seeks.
What are some of the advantages of clustered indexes over heaps?
A clustered index scan can be done in logical order or, when the logical order is not needed, in physical order or allocation order. In addition, SQL Server can perform a partial scan if sequential rows in the order of the clustering key are requested by your query.
What are some disadvantages of clustered indexes compared to heaps?
When using a clustered index, when you insert a new row into a full page, SQL Server has to split the page into two pages and move half of the rows into the second page. This happens because SQL Server needs to maintain the logical order of rows. This way, you get some internal fragmentation which you cannot get in a heap.
What is a scan and a seek?
An index scan retrieves all rows from the table. Index seek retrieves selective rows from the table.
What type of clustering key should you select for different business scenarios such as OLTP and DW?
For OLTP applications, selecting a sequential integer as the clustering key is typically a very good choice.
However, in DW scenarios, many queries read huge amounts of data, typically ordered. For example, many DW queries search for rows in order of a date or datetime column. If this is the case, then you might prefer to support such a partial scan and create a clustered index on the date column.
How do randomly generated GUIDs (NEWID) measure up as clustering keys?
GUIDs for clustering keys can lead to quite inefficient indexes.
There will be multiple page splits and no page will physically be in the correct logical order, e.g. external fragmentation. The external fragmentation mainly slows down scans which should not be that frequent in OLTP environments; however, they are very important to data warehousing area.
Internal fragmentation is a problem in both scenarios because the table is much bigger than it would be with a sequential key.
What is the difference between reorganizing an index vs rebuilding an index?
You can get rid of fragmentation if you rebuild or reorganize the index.
Reorganizing an index is a slower but less intrusive process than rebuilding an index.
Reorganizing physically reorganizes the leaf nodes of the index while a rebuild drops the existing Index and recreates the index.
Rebuilding uses more CPU and it locks the database resources.
As a general guideline, you should reorganize an index when the external fragmentation is less than 30 percent and rebuild it if it is greater than 30 percent.
What type of clustering key would you select for an OLTP environment?
For an OLTP environment, a short, unique, and sequential clustering key might be the best choice.
What is a nonclustered index?
A nonclustered index does not contain all of the data and serves as a pointer to table rows for quick seeks.
Nonclustered indexes have a very similar structure to clustered ones. Actually, the root and intermediate levels look the same as in a clustered index. The leaf level is different because it does not hold all of the data. What is stored on the leaf level of a nonclustered index depends on the underlying table organization whether it is organized as a heap or as a balanced tree.
The leaf level of a nonclustered index contains the index keys and "row locators". A row locator points to a row in the underlying table.
If the table is a heap, then the row locator is called a row identifier (RID). This is an 8-byte pointer containing the database file ID, page ID of the target row, and the target row ID on that page. The operation of retrieving the row from the heap is called "RID lookup".
If the table is organized as a balanced tree, then the row locator is the clustering key. This means that when SQL Server seeks for a row, it has to traverse all levels on a nonclustered index and then also all levels of a clustered index. This operation is called a "key lookup".
What is a filtered non clustered index?
There are atypical situations where including all of the records in a table is less than ideal.To assist in these scenarios, nonclustered indexes can be filtered to reduce the number of records they contain.
Filtered nonclustered indexes are useful when some values in a column occur rarely whereas other values occur frequently. In such cases, you would create a filtered index over the rare values only. SQL Server uses this index for seeks of rare values, but performs scans for frequent values.
Filtered indexes are inexpensive to maintain, because SQL Server has to update them for changes in the rare values only.
You create a filtered index by adding a WHERE clause to the CREATE INDEX statement.
How can you use a filtered nonclustered index to enforce a filtered uniqueness?
For example, imagine that a column has NULLs in multiple rows; however, known values must be unique. You cannot create a filtered PK or UQ constraint; however, you could create a filtered unique nonclustered index from known values only which would allow multiple NULLs and reject duplicate known values.
What is a columnstore index?
In addition to regular row storage, SQL Server 2012 can store index data column by column in what's called a columnstore index. Columnstore indexes can speed up data warehousing queries by a large factor, from 10 to even 100 times.
A columnstore index is just another noclustered index on a table. The SQL Server Query Optimizer considers using the columnstore index during the query optimization phase just as it does any other index. All you have to do to take advantage of this feature is create a columnstore index on a table.
A columnstore index is stored compressed. The compression factor can be up to 10 times the original size of the index. When a query references a single column that is a part of a columnstore index, then SQL Server fetches only that column from disk; it doesn't fetch entire rows as with row storage. This also reduces disk I/O and memory cache consumption. Columnstore indexes use their own compression algorithm; you cannot use Row or Page compression on a columnstore index.
On the other hand, SQL Server has to return rows. Therefore, rows must be reconstructed when you execute a query. This row reconstruction takes some time and uses some CPU memory resources.
How does the performance of a columnstore index measure up?
Columnstore indexes accelerate data warehouse queries, not OLTP workloads.
Because of the row reconstruction issues and other overhead when you update compressed data, tables containing a columnstore index become read-only. If you want to update a table containing a columnstore index, you must first drop the index.
How are columnstore indexes organized?
The columnstore index is divided into units called segments. Segments are stored as large objects and consist of multiple pages. Segments are the unit of transfer from disk to memory. Each segment has metadata that stores the min and max value of each column for that segment. This enables early segment elimination in the storage engine. SQL Server loads only those segments requested by a query into memory.
What are indexed views?
You can optimize queries that aggregate data and perform multiple joins by permanently storing the aggregated and joined data.
You can create a view with a query that joins and aggregates data. Then you can index the view to get an indexed view. With indexing, you are materializing a view.
Note that the view must be created with the SCHEMABINDING option if you want to index it. In addition, you must use the COUNT_BIG aggregate function.
Nevertheless, after you create the view and the index, execute the aggregate query again and measure the I/O.
How many clustered indexes can you create on a table?
There can only be 1 clustered index because this is the table itself organized as a balanced tree.