Sorting Lists, Ignoring the Definite Article 2

Posted by James Wilford Wed, 08 Dec 2004 10:28:00 GMT

Contributed by Dan Eastwell

You'd think that often people would need to sort lists ignoring the definite article 'the'. Film titles, band names, Sherlock Holmes cases. But a search of the web brings precious little, especially as you're searching for 'the'. Which is practically everywhere.

How do you order a list, e.g. of bands, whilst ignoring 'The': The Hives, The Strokes, etc, whilst keeping order with other bands such as Pram and Blur. I searched high and low for it, but with special thanks to James, we've come up with an answer.

A simple response to the problem would be to fetch the results and then sort using SQL having stripped the starting 'the'. However, in the project I'm working on currently, there's a need to allow for paging.

This means using SQL to strip the 'the' and using LIMIT to select only the current page of results.

You can order MySQL expressions according to other expressions, so, instead of a PHP function, instead of a new column with the same entry but with the 'the' removed (both of which I found as answers), you make use of the expression clause in ORDER BY.

The expression I first attempted was using REGEXP, which allows you to search or order, etc according to a regular expression. To start, I tried to order based on whether or not an artist contains, simply a letter. I had

SELECT 'Blur' REGEXP 'b';

That's true, hence => 1; is the results set (Note: REGEXP is not case sensitive for strings (i.e. non-binary). In order to do this, you need to cast the search parameter as binary, see the MySQL manual).

SELECT 'The Hives' REGEXP 'The ';

Is also true. When I put it in my ORDER BY expression


SELECT *
FROM artists
ORDER BY (
artist REGEXP 'The '
);

I got this result:


Blur
Pram
The Hives
The Strokes

The problem here is that the ORDER BY expresion orders first by things that don't match, then those that do match the regular expression. Since REGEXP only ever returns 0 or 1, you won't get the sorted list you're looking for. The master stroke comes in thinking that we just want to order the names of the bands, no matter if there's a 'The ' in there or not. This means, that for the purposes of the ORDER BY, we can trim the beginning 'The '.

This is where James' master-stroke comes in. We can use MySQL's string functions to trim only the parameter of the ORDER BY clause leaving the results intact! Happily, a function to do this exists in MySQL: TRIM.

The code:


SELECT artist 
FROM artists 
ORDER BY TRIM(
LEADING 'The ' FROM artist);

(Note: don't leave a space between TRIM and the open parentheses: you'll get a syntax error!)

What this code does is a simple SELECT, then orders the results by a function acting on 'artist'. TRIM will either slice off the LEADING or TRAILING or BOTH examples of a removal string: in this case 'The '. (For the record, if you don't specify LEADING or TRAILING or BOTH and the removal string, TRIM will trim the whitespace on either side. see the MySQL manual)



This gives us what we're looking for:


Blur
The Hives
Pram
The Strokes

And devotees of the bands are happy!

Dan Eastwell.