Regex – How does Stack Overflow generate its SEO-friendly URLs

friendly-urllanguage-agnosticregexseoslug

What is a good complete regular expression or some other process that would take the title:

How do you change a title to be part of the URL like Stack Overflow?

and turn it into

how-do-you-change-a-title-to-be-part-of-the-url-like-stack-overflow

that is used in the SEO-friendly URLs on Stack Overflow?

The development environment I am using is Ruby on Rails, but if there are some other platform-specific solutions (.NET, PHP, Django), I would love to see those too.

I am sure I (or another reader) will come across the same problem on a different platform down the line.

I am using custom routes, and I mainly want to know how to alter the string to all special characters are removed, it's all lowercase, and all whitespace is replaced.

Best Solution

Here's how we do it. Note that there are probably more edge conditions than you realize at first glance.

This is the second version, unrolled for 5x more performance (and yes, I benchmarked it). I figured I'd optimize it because this function can be called hundreds of times per page.

/// <summary>
/// Produces optional, URL-friendly version of a title, "like-this-one". 
/// hand-tuned for speed, reflects performance refactoring contributed
/// by John Gietzen (user otac0n) 
/// </summary>
public static string URLFriendly(string title)
{
    if (title == null) return "";

    const int maxlen = 80;
    int len = title.Length;
    bool prevdash = false;
    var sb = new StringBuilder(len);
    char c;

    for (int i = 0; i < len; i++)
    {
        c = title[i];
        if ((c >= 'a' && c <= 'z') || (c >= '0' && c <= '9'))
        {
            sb.Append(c);
            prevdash = false;
        }
        else if (c >= 'A' && c <= 'Z')
        {
            // tricky way to convert to lowercase
            sb.Append((char)(c | 32));
            prevdash = false;
        }
        else if (c == ' ' || c == ',' || c == '.' || c == '/' || 
            c == '\\' || c == '-' || c == '_' || c == '=')
        {
            if (!prevdash && sb.Length > 0)
            {
                sb.Append('-');
                prevdash = true;
            }
        }
        else if ((int)c >= 128)
        {
            int prevlen = sb.Length;
            sb.Append(RemapInternationalCharToAscii(c));
            if (prevlen != sb.Length) prevdash = false;
        }
        if (i == maxlen) break;
    }

    if (prevdash)
        return sb.ToString().Substring(0, sb.Length - 1);
    else
        return sb.ToString();
}

To see the previous version of the code this replaced (but is functionally equivalent to, and 5x faster), view revision history of this post (click the date link).

Also, the RemapInternationalCharToAscii method source code can be found here.

Related Question