You're very pleased with yourself: you've created your index, and are feeling supremely confident that your query is now going to run much faster. But something's wrong. The query is still slow, and your index is not being used at all. What's going on? It can be confusing, frustrating, and sometimes downright annoying, but there are plenty of simple reasons why your bright shiny index gets ignored.
Most modern database systems use cost-based optimisers. Put simply, these estimate the cost or effort required to run queries in different ways. They choose the way that requires the least effort, and then proceed with it. The optimiser will choose not to use your index if it thinks that doing so will be more 'expensive' than not using it.
Since you are reading this post, I can only assume that you believe your index should be used. It follows, therefore, that there must be a reason why not; there are basically two alternatives:
Let's have a look at some of the reasons why the optimiser correctly or incorrectly decides not to use a particular index. The following sections should give you some starting points for troubleshooting indexing problems.
Suppose we have a query like so:
SELECT Surname, FirstName FROM Contact WHERE toupper( Surname ) = "SMITH"
I have seen cases where developers have created an index on "Surname", thinking that it will be useful for the query. Unfortunately that is not the case: the index is not useful when we are performing some sort of function on the column we are searching on. In the above example, the surname is being converted to upper case first. So any of the following could become "SMITH" once converted to upper case:
Clearly, an index on this column would result in these entries being located in many different places, so the index is unhelpful for finding relevant rows. It is important to note that unless you are using function-based indexes, any manipulation of the column (by means of functions, addition, sub-strings etc.) is useless from an indexing point of view.
There are two main solutions to this issue:
This one is often overlooked. OK, so it is not going to give you a performance problem per se (unless you have a very high transaction environment): accessing a very small table (i.e. 1 - 2 pages of data) is going to be fast whether you use an index or table scan, but it may be confusing why the index is not used.
Let's say our data all fits on 1 page. To perform a table scan requires just 1 logical page read. If we create an index on the data then depending on the structure of your index, we are most likely to have a single root index page 'on top' of the data page. Now, of course, going to the data via the index requires 2 logical page reads: 1 for the index page, and 1 for the data page. 2 logical page reads > 1 logical page read; the optimiser is therefore correctly choosing the least cost access method by performing a table scan.
Let's look at a variation on the query above:
SELECT Surname, FirstName FROM Contact WHERE Age >= 18
Suppose we had a non-clustered index on "Age"; would the optimiser choose to use it? Well the answer in this case would be: "it depends". What it depends on is the number of rows that this query actually returns, compared to the total number of rows in the table.
In the query shown above, we are returning the "Surname" and "FirstName" columns for every contact who is at least 18 years old. The way that this is like to be serviced if the non-clustered index is used is as follows:
As you can see, due to the fact that we are constantly moving backwards and forwards between the index leaf level and the data, the total number of page reads will in fact be around double the number of matching rows. So as the number of rows returned by the query increases, the optimiser will at some point correctly decide that a table scan is in fact cheaper than using the index.
If we were to use a different age in our search criteria (e.g. ">= 90"), then the index probably will be used, since the number of matching entries is likely to be very small.
So there are a couple of ways of dealing with this scenario:
Even some of the most hardened developers suddenly break into a cold sweat when you start talking about join order. One developer I worked with even admitted that his approach to dealing with tricky queries was to just keep trying different indexes until one worked; he never even considered that he should be worrying about the join order.
Nested-loop joins are the bread-and-butter join strategy for databases. There are other, more fanciful types such as merge joins, but more often than not the optimiser in your database will choose a standard nested-loop join. The way these are processed is fairly straight forward: for every row that matches the search criteria in the first table, it will find every row that matches in the second table, and then the third etc. When it joins from one table to the next, it will use all the information that it knows from the preceeding tables.
This means that the index will only be used for a join if it is useful, based on the information that is already known from the search arguments and previous tables in the join order. Quite often, however, we make assumptions about the join order which turn out to be incorrect. If the index is useless for the actual join order, then it will not be used.
So verify that the join order chosen by the optimiser is the one you want, and that the indexes on your tables are suitable for that join order.
I often come across developers who seem to think that there is some kind of magic going on inside database engines, or that somehow optimisers have a powerful artificial intelligence that allows them to creatively think of unique and novel ways to access their data.
The fact is, optimisers don't have a crystal ball, so in all but the most trival cases they can never know what the actual cost of running a query is going to be until after they've actually done it. The optimiser can only estimate the amount of effort required to execute a query. So how does it do it? It uses statistics.
Statistics provide various bits of information about the data itself; the following are examples of the type of statistics that may be kept by your database engine:
The more sophisticated the optimiser, the more advanced the techniques for performing cost estimation. Some, for example, will also factor in the likely ratio of cached to non-cached pages for a particular access method. These statistics can be expensive to create and maintain, so many of them will not be maintained on an on-going (real-time) basis. This is fine for relatively static data: look-up tables etc., but for data that changes often, it is easy for statistics to become out of date.
It is imperative, therefore, to update your statistics regularly, preferably with an automated job that runs during quiet times such as overnight or at weekends. Some databases allow you to determine when the data has changed by a particular amount (e.g. 10%), which is useful if you don't want to update the statistics unnecessarily. Some even allow you to load the statistics in manually; this means that you can take a back-up of your database, load it into a secondary server, update the statistics on that one, and then load them back into your primary database.
Out-of-date statistics are a classic cause of indexes being ignored (or even the wrong indexes being used), especially if the amount of data in a table has changed significantly since the index was first created. So whatever method you use to do it, always make sure your statistics are up to date.
One of the statistics that database systems often keep is the distribution of values in an index key. So in the example I used previously, if we have an index on "Age" the optimiser keeps statistics that allow it to estimate how many rows of a table will be returned for various search criteria such as "Age > 18", "Age between 20 and 30" etc.
These statistics are generally stored as a set of samples of the data, and the optimiser will perform various calculations and interpolations to estimate how many rows match the search criteria. This method works best if the key values are uniformly distributed: if dates, unique ids, strings, etc. are spread evenly throughout the range of values.
Occasionally, however, this does not happen. An excellent example of non-uniform distribution of key values I once came across was that of international mobile telephone numbers. These were prefixed with the country codes: 44 (UK), 34 (Spain), Germany (49), etc., and so there were 'islands' of values clustered around the mobile STD codes within those countries.
This made it difficult for the optimiser to correctly interpolate the number of rows that matched any particular query. Indeed, its estimates were way off most of the time, resulting in it failing to choose the correct index. There are a couple of ways around this problem:
Always compare the optimiser's estimate of the number of matching rows to the actual number. If they are drastically different, then this may be the reason that your index is not being used.
The key value distribution statistics mentioned in the previous section are great when we know the actual value we are looking for, but not so great when it comes to joins.
If I were to join to the "Contact" table mentioned previously using "Age" for example, the optimiser has no idea how many rows it is likely to return, since it won't know that actual age until the query is running. This is obviously a problem, and different RDBMSs get around it in different ways:
This is not an easy problem to fix, but two possible things you can try are:
Well it's possible: you may have completely misunderstood the query and/or indexing and created something which is completely useless. Probably best to get someone else to check it, just to be sure!
I hope you find these insights useful when troubleshooting your indexing problems. There are probably some more obscure issues that I have not mentioned here, but these are certainly the most common ones that I have encountered. Good luck!