Natural Sort Order in C @ jjacky.com

Natural Sort Order in C

As Jeff Atwood said back in 2007, "The default sort functions in almost every programming language are poorly suited for human consumption."

This was true then, and remains true today. Working on a project of mine, I needed to sort strings, filenames to be precise, using the famous natural sort order. Because that's how we expect things to work, "foo2" comes before "foo15"

Unfortunately, it's not as simple as one might hope. Most sort functions are indeed pretty bas at this, because while they might do a lot of great things, when it comes to comparing strings containing numbers they still have no understanding of natural order, and use what's referred to as "ASCII order".

After some research, knowing that I'm working in C and using GLib, here's what I ended up with :

Either you write your own comparison function,
Or you need to try and "workaround" the existing functions that are locale-based (e.g. strcoll(3)).

The former is the best option in terms of the control you'll have over things, obviously, but it also needs the more work. And if implementing natural sort order is easy enough, it gets a whole more complicated if you want to do locale-based comparisons.

So the later option is usually taken, since most of the work (the locale-based stuff) is done for you.

GLib, collation keys, and natural order

Using GLib, one can use g_utf8_collate_key() to get a collation key they will take into account all that the current locale needs. So it works well when it comes to handling characters with accents, or German's "ß" which should be equivalent to a double "s" or something.

That's great, but unfortunately there are limitations :

you have no control over things. For instance, it will be case sensitive or not, based on the locale. And there's no way to change it. (Well, one could force a g_utf8_casefold() before, to force case insensitivity, but (1) as the documentation says, "calling g_utf8_casefold() followed by g_utf8_collate() is only an approximation to the correct linguistic case insensitive ordering, though it is a fairly good one. Getting this exactly right would require a more sophisticated collation function that takes case sensitivity into account" and (2) this would only allow you to force case insensitivity, but you'd still have no way to force case sensitivity.
it doesn't do natural sort order, so "foo15" comes before "foo2" (even though I'm guessing in all locales, people see 2 as being lower than 15).

So you can try and work around it, as is done for instance in g_utf8_collate_key_for_filename() which aims to do things such as add natural sort order.

The way it does it that it will basically "split" the strings compared into chunks, and use the usual g_utf8_collate_key() on each chunk individually. This allows to split before a number, and make sure said numbers will be treated so that 2 comes before 15.

It does get a natural order, but it has a cost: Anything that the locale-based function would have done by treating the string as a whole is gone. As a simple example, consider the following:

Test1
tesT1
test1
tEst2
test2
test3
test15
test23

They are ordered as one would expect. But that's not what you would get with g_utf8_collate_key_for_filename() (I'm assuming here that your locale makes it be case insensitive) because due to its "splitting" of the string, the comparison basically sees all those as the same string "test" and therefore takes case sensitivity into account to perform the ordering, and that before realizing that the string isn't over, and there are numbers that follow. So what you get, is this odd order :

Test1
tEst2
tesT1
test1
test2
test3
test15
test23

Yes, there's (some) natural order going on, but that's not exactly what would be expected, is it?

Now as I said, we can also workaround this by calling g_utf8_casefold() ahead of time, thus forcing case insensitivity. But then all the "test1" strings would be considered absolutely the same, and their order would therefore be unpredictable (i.e. based on the order strings are compared/were originally ordered), which isn't necessarily ideal either.

Natural order done manually

Unless you're satisfied with the above, you'll now be looking back at option 1: doing it yourself. Originally, I went with option 2 myself, simply writing my own alternative to g_utf8_collate_key_for_filename() so that I could have options to e.g. enable or not natural order.

But as described, this still isn't entirely satisfying, and eventually I looked back at that former option.

Implementing a comparison function to handle (amongst other things) natural order is relatively easy, since all you need to do is compare numbers as such, instead of looking at their character value.

So I did create my own function, and it offers options for natural order, case sensitivity, and a few things that might be useful when you deal with filenames (e.g. putting dot files first, or mixed).

Because we don't actually "split" the strings in chunks, or because we do it without losing focus the entire string as a whole, we can deal properly with the case described above, where we don't see all those strings as "test" and a number, but a few ones all being "test1" and only for those do we fall back on case sensitivity, and so on.

And if you're dealing with basically English strings, as in mostly made of ASCII characters, it looks perfect. Problems arise when you're in a different locale, and you have things like accents to consider. I'd like to now talk about the solution I found, only I did not found one.

This is quite a complicated thing: With the former option, we fall back on g_utf8_collate_key() which itself just falls back on the good old strxfrm() to handle this. It does the job well, but we can't sneak it to add options, or natural order support.

And having to handle the whole locale-based thing on your own is just too much for me I'm afraid. (Some kind of basic character replacement could be done, e.g. replace all of "é", "ê", "ë", "è", and the others with "e" before comparing. That might give better results for (most?) easy cases, but there will surely be lots of cases where it just doesn't work, so it's really not a solution.)

Conclusion?

It's not as easy as it looks, and I don't have a solution, unfortunately. So far here's what I'm gonna go with (for now?) : both options.

It helps that I did do both already, I guess, but I just don't see one as clearly better than the other, it might just depend on what you need and which files/strings you're dealing with.

Using the collate key functions from GLib will probably give results that are pretty good most of the times, where using my own implementation of the natural sort order might give better results unless you're using a locale that isn't ASCII oriented, basically.

Anyways, that's what I gathered from my looking into this, hopefully this may have been helpful to some of you.

Natural Sorting utility

In case you are yourself dealing with this, or looking for a solution, let me share with you my implementation of the natural sort order. As I said, it's not locale-based, but it does support natural order, if strings are otherwise equal, it will use the number of leading zeros (strings with less come first, just like shorter strings come first).

In the case of case-insensitive comparison, if strings are equals then the result of case-sensitive comparison will be returned, to keep order predictable.

As I said, I'm mostly dealing with filenames myself, so there are also a couple of options that probably don't make sense outside of this, e.g. having strings starting with a dot listed first, or have them mixed together (instead of the default, which would have them between strings starting with numbers and those starting with (upper case) letters).

I'm not gonna bore you with the code here, it's on github as part of a little utility: natsort

It's a small CLI tool that will take strings from its standard input, sort them based on specified options (if any), and print them on its standard output. (And while I said I was working with GLib, this is actually not dependent on GLib.)

Have fun with all this, and if you do implement/have a better solution, as in one that both does what natsort does, but also supports locale-based comparisons, please let me know!

∞ Posted 22 March 2013

natsort

Top of Page