Testing Character Types With Look-up Tables

Look-up tables are an ideal way of testing whether characters are of a particular type, and are commonly used in C for functions such as isupper, isdigit etc. You can also use them for your own custom classification tests.

The ASCII character set is particularly amenable to look-up tables, since it is relatively small and requires a table of just 128 entries.

Supposing we are writing an application in C, and want to imitate the isxdigit function, which returns true (i.e. a non-zero value) if the character is a valid hexadecimal digit. The simplest way of performing this test is as follows:

/* ... */ char c; /* ... */ if ( ( c >= '0' && c <= '9' ) || ( c >= 'A' && c <= 'F' ) || ( c >= 'a' && c <= 'f' ) ) { /* The character is a valid hexadecimal digit - do something as a result */ } /* ... */ 

If we are being good modular coders, then we would want to do something slightly more elegant, and potentially wrap it up in a function:

/* ... */ int ishexdigit( char ); /* ... */ char c; /* ... */ if ( ishexdigit( c ) ) { /* The character is a valid hexadecimal digit - do something as a result */ } /* ... */ /** * Returns: *   true, if the supplied character is a valid hexadecimal digit, *   false, otherwise. */ int ishexdigit( char c ) { if ( ( c >= '0' && c <= '9' ) || ( c >= 'A' && c <= 'F' ) || ( c >= 'a' && c <= 'f' ) ) return ( 1 ); return ( 0 ); } /* ... */ 

If you want to avoid the overhead of a function call, the temptation may be to create a macro to effectively in-line the test. Be aware, however, that this will cause problems with pre and post operators:

/* ... */ #define ISXDIGIT(c) ( ( (c) >= '0' && (c) <= '9' ) || \ ( (c) >= 'A' && (c) <= 'F' ) || \ ( (c) >= 'a' && (c) <= 'f' ) ? 1 : 0 ) /* ... */ char *pC; /* * The following test is WRONG; it is equivalent to: * *   ( ( (*(pC++)) >= '0' && (*(pC++)) <= '9' ) || *     ( (*(pC++)) >= 'A' && (*(pC++)) <= 'F' ) || *     ( (*(pC++)) >= 'a' && (*(pC++)) <= 'f' ) ? 1 : 0 ) * */ if ( ISXDIGIT( *( pC++ ) ) ) { /* Who knows what will happen now?! */ } /* ... */ 

One way to increase the performance of a function-based solution is to in-line the function (e.g. using the inline keyword and/or getting the optimiser do it). However, another way to improve the speed of the test, without getting problems with pre or post operators, is to use look-up tables.

In order to use a look-up table, we must first create a 128-entry table - one for each ASCII character.

/* ... */ const int isXDigitLookupTable[128] = { /* decimal  character    is hex digit? */ /*     0       NUL    */      0,  /*     1       SOH    */      0, /*   2 - 10           */         0, 0, 0, 0, 0, 0, 0, 0, 0, /*  11 - 20           */      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /*  21 - 30           */      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /*  31 - 40           */      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /*  41 - 46           */      0, 0, 0, 0, 0, 0, /*    47        /     */      0, /*    48        0     */      1, /*    49        1     */      1, /*    50        2     */      1, /*    51        3     */      1, /*    52        4     */      1, /*    53        5     */      1, /*    54        6     */      1, /*    55        7     */      1, /*    56        8     */      1, /*    57        9     */      1, /*    58        :     */      0, /*  59 - 60           */                              0, 0, /*  61 - 63           */      0, 0, 0, /*    64        @     */      0, /*    65        A     */      1, /*    66        B     */      1, /*    67        C     */      1, /*    68        D     */      1, /*    69        E     */      1, /*    70        F     */      1, /*    71        G     */      0, /*  72 - 80           */         0, 0, 0, 0, 0, 0, 0, 0, 0, /*  81 - 90           */      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /*  91 - 95           */      0, 0, 0, 0, 0, /*    96        `     */      0, /*    97        a     */      1, /*    98        b     */      1, /*    99        c     */      1, /*   100        d     */      1, /*   101        e     */      1, /*   102        f     */      1, /*   103        g     */      0, /* 104 - 110          */               0, 0, 0, 0, 0, 0, 0, /* 111 - 120          */      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 121 - 125          */      0, 0, 0, 0, 0, /*   126        ~     */      0, /*   127       DEL    */      0 }; #define ISXDIGIT(c) ( isXDigitLookupTable[ c ] ) /* ... */ char *pC; /* ... */ if ( ISXDIGIT( *( pC++ ) ) ) { /* The character is a valid hexadecimal digit - do something as a result */ } /* ... */ 

Now, instead of performing three separate range tests we are performing a single table look-up. Remember the key point about look-up tables is that they are useful if the cost of a memory look-up is lower than the cost of performing the actual calculation (in this case the three separate range tests), and when there is little impact of allocating the memory used by the table. The latter consideration is generally only relevant for very large look-up tables, or resource-constrained (e.g. embedded) systems.

If we had a number of different classifications that we wanted tests for, we could make the table more efficient. Currently every entry takes up a whole integer (usually 32 bits) but only one bit is used. We can combine categories by building up a bit mask for each entry. As an example, here is a table that allows us to test for either decimal or hexadecimal digits:

/* ... */ #define DIGIT_BIT  1 #define XDIGIT_BIT 2 const int charLookupTable[128] = { /* decimal char   classification          */ /*    0    NUL */         0 |          0, /* Neither decimal nor hex digit */ /*    1    SOH */         0 |          0, /*   2 - 10    */    0, 0, 0, 0, 0, 0, 0, 0, 0, /*  11 - 20    */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /*  21 - 30    */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /*  31 - 40    */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /*  41 - 46    */ 0, 0, 0, 0, 0, 0, /*   47     /  */         0 |          0, /*   48     0  */ DIGIT_BIT | XDIGIT_BIT, /* Both decimal and hex digit */ /*   49     1  */ DIGIT_BIT | XDIGIT_BIT, /*   50     2  */ DIGIT_BIT | XDIGIT_BIT, /*   51     3  */ DIGIT_BIT | XDIGIT_BIT, /*   52     4  */ DIGIT_BIT | XDIGIT_BIT, /*   53     5  */ DIGIT_BIT | XDIGIT_BIT, /*   54     6  */ DIGIT_BIT | XDIGIT_BIT, /*   55     7  */ DIGIT_BIT | XDIGIT_BIT, /*   56     8  */ DIGIT_BIT | XDIGIT_BIT, /*   57     9  */ DIGIT_BIT | XDIGIT_BIT, /*   58     :  */         0 |          0, /*  59 - 60    */                         0, 0, /*  61 - 63    */ 0, 0, 0, /*   64     @  */         0 |          0, /*   65     A  */         0 | XDIGIT_BIT, /* Just hex digit */ /*   66     B  */         0 | XDIGIT_BIT, /*   67     C  */         0 | XDIGIT_BIT, /*   68     D  */         0 | XDIGIT_BIT, /*   69     E  */         0 | XDIGIT_BIT, /*   70     F  */         0 | XDIGIT_BIT, /*   71     G  */         0 |          0, /*  72 - 80    */    0, 0, 0, 0, 0, 0, 0, 0, 0, /*  81 - 90    */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /*  91 - 95    */ 0, 0, 0, 0, 0, /*   96     `  */         0 |          0, /*   97     a  */         0 | XDIGIT_BIT, /* Just hex digit */ /*   98     b  */         0 | XDIGIT_BIT, /*   99     c  */         0 | XDIGIT_BIT, /*  100     d  */         0 | XDIGIT_BIT, /*  101     e  */         0 | XDIGIT_BIT, /*  102     f  */         0 | XDIGIT_BIT, /*  103     g  */         0 |          0, /* 104 - 110   */          0, 0, 0, 0, 0, 0, 0, /* 111 - 120   */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, /* 121 - 125   */ 0, 0, 0, 0, 0, /*  126     ~  */         0 |          0, /*  127    DEL */         0 |          0 }; #define ISDIGIT(c)  ( charLookupTable[ c ] & DIGIT_BIT  ) #define ISXDIGIT(c) ( charLookupTable[ c ] & XDIGIT_BIT ) /* ... */ char *pC; /* ... */ if ( ISDIGIT( *pC ) ) { /* The character is a valid decimal digit - do something as a result */ } if ( ISXDIGIT( *pC ) ) { /* The character is a valid hexadecimal digit - do something as a result */ } /* ... */ 

Remember that, as always, there is a trade-off: the more efficient table comes at the price of an extra operation - the logical AND ('&') when we are testing for the bit we are interested in. If you wanted to squeeze every ounce of speed out of the test and had plenty of memory to spare, you could create distinct look-up tables for each classification.

Note also that you will need to ensure that your table element - here the character value c - is in range so that you don't try and access memory outside the table. The same caveat applies if you are developing in Java, which uses a Unicode character set: you can use look-up tables if you are sure (or ensure) that the characters are restricted to just the ASCII ones, and thus do not stray out of the range 0 - 127. Other things to remember for Java look-up tables are:

  • The if clause requires a boolean result, so if the table is for a single character type each element must be of type boolean.
  • If you are combining multiple types into a single table, you must test for a non-zero result; you might use an expression such as: ( charLookupTable[ c ] & DIGIT_BIT ) != 0. C, of course, automatically treats a non-zero value as being "true".
  • Macros aren't available, so the look-up must be done either in-line or in a function.
  • You will get an array out of bounds exception if your character value is out of range, rather than a random result or segmentation fault as you would in C.

Back to top

Subscribe

To receive updates, and to get your free guide:

SQL Tips and Tricks