Python – Turn a string into a valid filename

filenamespythonsanitizeslug

I have a string that I want to use as a filename, so I want to remove all characters that wouldn't be allowed in filenames, using Python.

I'd rather be strict than otherwise, so let's say I want to retain only letters, digits, and a small set of other characters like "_-.() ". What's the most elegant solution?

The filename needs to be valid on multiple operating systems (Windows, Linux and Mac OS) – it's an MP3 file in my library with the song title as the filename, and is shared and backed up between 3 machines.

Best Solution

You can look at the Django framework for how they create a "slug" from arbitrary text. A slug is URL- and filename- friendly.

The Django text utils define a function, slugify(), that's probably the gold standard for this kind of thing. Essentially, their code is the following.

import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Taken from https://github.com/django/django/blob/master/django/utils/text.py
    Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren't alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value.lower())
    return re.sub(r'[-\s]+', '-', value).strip('-_')

And the older version:

def slugify(value):
    """
    Normalizes string, converts to lowercase, removes non-alpha characters,
    and converts spaces to hyphens.
    """
    import unicodedata
    value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
    value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
    value = unicode(re.sub('[-\s]+', '-', value))
    # ...
    return value

There's more, but I left it out, since it doesn't address slugification, but escaping.