Most modern database systems use cost-based optimisers. Put simply, these estimate the cost or effort required to run queries in different ways. They choose the way that requires the least effort, and then proceed with it. The optimiser will choose not to use your index if it thinks that doing so will be more 'expensive' than not using it.
Since you are reading this post, I can only assume that you believe your index should be used. It follows, therefore, that there must be a reason why not; there are basically two alternatives:
- You are wrong and it is right: the optimiser has correctly ascertained that it is more expensive to use your index than not.
- It is wrong and you are right, so the problem is that something is making it believe that it will be more expensive to use the index.
Let's have a look at some of the reasons why the optimiser correctly or incorrectly decides not to use a particular index. The following sections should give you some starting points for troubleshooting indexing problems.
Functions on search arguments
Suppose we have a query like so:
SELECT Surname, FirstName FROM Contact WHERE toupper( Surname ) = "SMITH"
I have seen cases where developers have created an index on "Surname", thinking that it will be useful for the query. Unfortunately that is not the case: the index is not useful when we are performing some sort of function on the column we are searching on. In the above example, the surname is being converted to upper case first. So any of the following could become "SMITH" once converted to upper case:
Clearly, an index on this column would result in these entries being located in many different places, so the index is unhelpful for finding relevant rows. It is important to note that unless you are using function-based indexes, any manipulation of the column (by means of functions, addition, sub-strings etc.) is useless from an indexing point of view.
There are two main solutions to this issue:
- If your database supports function-based indexes, create an index on the expression instead of the column. In the example above, we would create an index on "toupper( Surname)" rather than on just "Surname".
- Create an additional column containing the result of the expression, and index that instead. For example, we could create a column called "UpperCaseSurname" which always held the surname converted to upper case. We would then create an index on "UpperCaseSurname" rather than "Surname". When doing this, always take care to populate this column correctly, after any INSERTs or UPDATEs. Some databases allow you to create derived columns that take care of this for you.
The table is too small
This one is often overlooked. OK, so it is not going to give you a performance problem per se (unless you have a very high transaction environment): accessing a very small table (i.e. 1 - 2 pages of data) is going to be fast whether you use an index or table scan, but it may be confusing why the index is not used.
Let's say our data all fits on 1 page. To perform a table scan requires just 1 logical page read. If we create an index on the data then depending on the structure of your index, we are most likely to have a single root index page 'on top' of the data page. Now, of course, going to the data via the index requires 2 logical page reads: 1 for the index page, and 1 for the data page. 2 logical page reads > 1 logical page read; the optimiser is therefore correctly choosing the least cost access method by performing a table scan.
The query returns many rows, but is not covered
Let's look at a variation on the query above:
SELECT Surname, FirstName FROM Contact WHERE Age >= 18
Suppose we had a non-clustered index on "Age"; would the optimiser choose to use it? Well the answer in this case would be: "it depends". What it depends on is the number of rows that this query actually returns, compared to the total number of rows in the table.
In the query shown above, we are returning the "Surname" and "FirstName" columns for every contact who is at least 18 years old. The way that this is likely to be serviced if the non-clustered index is used is as follows:
- The index is used to find the first index entry (i.e. the first entry with an age of 18).
- For each index entry, the corresponding data page will be read to get the surname and first name, and return it.
- A page read will be issued for the index page that we came from, and we will move on to the next index entry.
As you can see, due to the fact that we are constantly moving backwards and forwards between the index leaf level and the data, the total number of page reads will in fact be around double the number of matching rows. So as the number of rows returned by the query increases, the optimiser will at some point correctly decide that a table scan is in fact cheaper than using the index.
If we were to use a different age in our search criteria (e.g. ">= 90"), then the index probably will be used, since the number of matching entries is likely to be very small.
So there are a couple of ways of dealing with this scenario:
- Create a clustered index (index only table) instead of a non-clustered one. Now the data and the leaf index level are the same thing, so there is no switching back and forth. This assumes, of course, that you don't already have a clustered index that is required by a different query.
- Expand the non-clustered index to include both "Surname" and "FirstName" (keeping "Age" as the first column). Now the leaf index level contains all the information that is required by the query, i.e. 'covers' it, which means that the data pages do not need to be read.
The table join order is not what you expect
Even some of the most hardened developers suddenly break into a cold sweat when you start talking about join order. One developer I worked with even admitted that his approach to dealing with tricky queries was to just keep trying different indexes until one worked; he never even considered that he should be worrying about the join order.
Nested-loop joins are the bread-and-butter join strategy for databases. There are other, more fanciful types such as merge joins, but more often than not the optimiser in your database will choose a standard nested-loop join. The way these are processed is fairly straight forward: for every row that matches the search criteria in the first table, it will find every row that matches in the second table, and then the third etc. When it joins from one table to the next, it will use all the information that it knows from the preceeding tables.
This means that the index will only be used for a join if it is useful, based on the information that is already known from the search arguments and previous tables in the join order. Quite often, however, we make assumptions about the join order which turn out to be incorrect. If the index is useless for the actual join order, then it will not be used.
So verify that the join order chosen by the optimiser is the one you want, and that the indexes on your tables are suitable for that join order.
The statistics are out of date
I often come across developers who seem to think that there is some kind of magic going on inside database engines, or that somehow optimisers have a powerful artificial intelligence that allows them to creatively think of unique and novel ways to access their data.
The fact is, optimisers don't have a crystal ball, so in all but the most trival cases they can never know what the actual cost of running a query is going to be until after they've actually done it. The optimiser can only estimate the amount of effort required to execute a query. So how does it do it? It uses statistics.
Statistics provide various bits of information about the data itself; the following are examples of the type of statistics that may be kept by your database engine:
- The total number of rows in a table
- The total number data pages occupied by a table
- The degree of fragmentation of a table (how contiguous the data is on disc)
- The total number of pages in the leaf (lowest) level of an index
- The degree of fragmentation of the leaf level of an index
- The number of levels in an index
The more sophisticated the optimiser, the more advanced the techniques for performing cost estimation. Some, for example, will also factor in the likely ratio of cached to non-cached pages for a particular access method. These statistics can be expensive to create and maintain, so many of them will not be maintained on an on-going (real-time) basis. This is fine for relatively static data: look-up tables etc., but for data that changes often, it is easy for statistics to become out of date.
It is imperative, therefore, to update your statistics regularly, preferably with an automated job that runs during quiet times such as overnight or at weekends. Some databases allow you to determine when the data has changed by a particular amount (e.g. 10%), which is useful if you don't want to update the statistics unnecessarily. Some even allow you to load the statistics in manually; this means that you can take a back-up of your database, load it into a secondary server, update the statistics on that one, and then load them back into your primary database.
Out-of-date statistics are a classic cause of indexes being ignored (or even the wrong indexes being used), especially if the amount of data in a table has changed significantly since the index was first created. So whatever method you use to do it, always make sure your statistics are up to date.
The statistics are unhelpful because the key distribution is not even
One of the statistics that database systems often keep is the distribution of values in an index key. So in the example I used previously, if we have an index on "Age" the optimiser keeps statistics that allow it to estimate how many rows of a table will be returned for various search criteria such as "Age > 18", "Age between 20 and 30" etc.
These statistics are generally stored as a set of samples of the data, and the optimiser will perform various calculations and interpolations to estimate how many rows match the search criteria. This method works best if the key values are uniformly distributed: if dates, unique ids, strings, etc. are spread evenly throughout the range of values.
Occasionally, however, this does not happen. An excellent example of non-uniform distribution of key values I once came across was that of international mobile telephone numbers. These were prefixed with the country codes: 44 (UK), 34 (Spain), Germany (49), etc., and so there were 'islands' of values clustered around the mobile STD codes within those countries.
This made it difficult for the optimiser to correctly interpolate the number of rows that matched any particular query. Indeed, its estimates were way off most of the time, resulting in it failing to choose the correct index. There are a couple of ways around this problem:
- Some database systems allow you to increase the number of sample points (or histograms) that are used for estimating key value distribution, thus providing better estimates for scewed data. This was the solution I ultimately employed for the telephone number problem - in fact I had to raise it from 20 to 200 before the estimates were close enough for it to select the correct index.
- Try changing the index so that even with estimates that are incorrect, the optimiser still chooses to use them: create either a clustered or a covering non-clustered one.
Always compare the optimiser's estimate of the number of matching rows to the actual number. If they are drastically different, then this may be the reason that your index is not being used.
The statistics are unhelpful because there is a join
The key value distribution statistics mentioned in the previous section are great when we know the actual value we are looking for, but not so great when it comes to joins.
If I were to join to the "Contact" table mentioned previously using "Age" for example, the optimiser has no idea how many rows it is likely to return, since it won't know that actual age until the query is running. This is obviously a problem, and different RDBMSs get around it in different ways:
- Some will work out the average number of rows that would be returned using an equality join. Obviously this is not much use if the join uses a range expression.
- Some will use default values as estimates. For example: an equality join matches 10% of the table, a closed range (e.g. BETWEEN clause) matches 25% etc. Clearly, these are massive generalisations that can be extremely inaccurate.
This is not an easy problem to fix, but two possible things you can try are:
- If your database calculates an average number of matching rows, try to ensure an even distribution of key values and keep your statistics up to date.
- Try making your index compelling for the optimiser even with any inaccurate estimates, again by creating a clustered or covering non-clustered index.
The index is wrong
Well it's possible: you may have completely misunderstood the query and/or indexing and created something which is completely useless. Probably best to get someone else to check it, just to be sure!
I hope you find these insights useful when troubleshooting your indexing problems. There are probably some more obscure issues that I have not mentioned here, but these are certainly the most common ones that I have encountered. Good luck!