Php – Routine for removing ALL junk from incoming strings

phpsanitizationstring

Sometimes when a user is copying and pasting data into an input form we get characters like the following:

didn’t,“ for beginning quotes and †for end quote, etc …

I use this routine to sanitize most input on web forms (I wrote it a while ago but am also looking for improvements):

function fnSanitizePost($data) //escapes,strips and trims all members of the post array
{
    if(is_array($data))
    {
    $areturn = array();
    foreach($data as $skey=>$svalue)
    {
      $areturn[$skey] = fnSanitizePost($svalue);
    }
    return $areturn;
  }
  else
    {
      if(!is_numeric($data))
        {
            //with magic quotes on, the input gets escaped twice, which means that we have to strip those slashes. leaving data in your database with slashes in them, is a bad idea
            if(get_magic_quotes_gpc()) //gets current configuration setting of magic quotes
      {
        $data = stripslahes($data);
      }
        $data = pg_escape_string($data); //escapes a string for insertion into the database
        $data = strip_tags($data);  //strips HTML and PHP tags from a string
      }
        $data = trim($data);  //trims whitespace from beginning and end of a string
      return $data;
    }
}

I really want to avoid characters like I mention above from ever getting stored in the database, do I need to add some regex replacements in my sanitizing routine?

Thanks,

- Nicholas

Best Solution

didn’t,“ for beginning quotes and †for end quote

That's not junk, those are legitimate “smart quote” characters that have been passed to you encoded as UTF-8, but read, incorrectly, as ISO-8859-1.

You can try to get rid of them or try to parse them into plain old Latin-1 using utf_decode, but if you do you'll have an application that won't let you type anything outside ASCII, which in this day and age is a pretty poor show.

Better if you can manage it is to have all your pages served as UTF-8, all your form submissions coming in as UTF-8, and all your database contents stored as UTF-8. Ideally, your application would work internally with all Unicode characters, but unfortunately PHP as a language doesn't have native Unicode strings, so it's usually a case of holding all your strings also as UTF-8, and taking the risk of occasionally truncating a UTF-8 sequence and getting a �, unless you want to grapple with mbstring.

$data = pg_escape_string($data); //escapes a string for insertion into the database

$data = strip_tags($data); //strips HTML and PHP tags from a string

You don't want to do that as a sanitisation measure coming into your application. Keep all your strings in plain text form for handling them, then pg_escape_string() only on the way out to a Postgres query, and htmlspecialchars() only on the way out to an HTML page.

Otherwise you'll get weird things like SQL escapes appearing on variables that have passed straight through the script to the output page, and no-one will be able to use a plain less-than character.

One thing you can usefully do as a sanitisation measure is to remove any control codes in strings (other than newlines, \n, which you might conceivably want).

$data= preg_replace('/[\x00-\x09\x0B-\x19\x7F]/', '', $data);