PHP utf8 problem

phputf-8

I have some problems comparing an array with Norwegian characters with a utf8 character.

All characters except the special Norwegian characters(æ, ø, å) works fine.

function isNorwegianChar($Char)
{
    $aNorwegianChars = array('a', 'A', 'b', 'B', 'c', 'C', 'd', 'D', 'e', 'E', 'f', 'F', 'g', 'G', 'h', 'H', 'i', 'I', 'j', 'J', 'k', 'K', 'l', 'L', 'm', 'M', 'n', 'N', 'o', 'O', 'p', 'P', 'q', 'Q', 'r', 'R', 's', 'S', 't', 'T', 'u', 'U', 'v', 'V', 'w', 'W', 'x', 'X', 'y', 'Y', 'z', 'Z', 'æ', 'Æ', 'ø', 'Ø', 'å', 'Å', '=', '(', ')', ' ', '-');
    $iArrayLength = count($aNorwegianChars);

    for($iCount = 0; $iCount < $iArrayLength; $iCount++)
    {
        if($aNorwegianChars[$iCount] == $Char)
        {
            return true;
        }
    }

    return false;

}

If anyone has any idea about what I can do pleas let me know.

Update:

The reason for needing this is that I'm trying to parse a text file that contains lines with Norwegian and Chinese words, like a dictionary. I want to split the line in to strings, one containing the Norwegian word and one containing the Chinese. This will later be inserted in a database. Example lines:

impulsiv 形 衝動的

imøtegå 動 反對,反駁

imøtekomme 動 符合

alkoholmisbruk(er) 名 濫用酒精 (名 濫用酒精的人)

alkoholpåvirket 形 受酒精影響的

alkotest 名 呼吸性酒精測試

alkymi(st) 名 煉金術 (名 煉金術士)

all, alt, alle, 形 全部, 所有

As you can see there might be spaces between the words so I can not use something easy like explode to split between the Chinese and Norwegian words. What I do is use the isNorwegianChar and loop through the line until I find a char that is not in the array.

The problem is that it æ, ø and å is not returned as a Norwegian character and it think the Chinese word has started.

Here is the code:

   //Open file.
$rFile = fopen("norsk-kinesisk.txt", "r");

// Loop through the file.
$Count = 0;
while(!feof($rFile))
{
    if(40== $Count)
    {
        break;
    }

    $sLine = fgets($rFile);

    if(0 == $Count)
    {
        $sLine = mb_substr($sLine, 3);
    }

    $iLineLength        = strlen($sLine);
    $bChineseHasStarted = false;
    $sNorwegianWord     = '';
    $sChineseWord       = '';
    for($iCount2 = 0; $iCount2 < $iLineLength; $iCount2++)
    {
        $char = mb_substr($sLine, $iCount2, 1);

        if(($bChineseHasStarted === false) && (false == isNorwegianChar($char)))
        {
            $bChineseHasStarted = true;
        }

        if(false === $bChineseHasStarted)
        {
            $sNorwegianWord .= $char;
        }
        else
        {
            $sChineseWord .= $char;
        }

        //echo $char;
    }

    $sNorwegianWord = trim($sNorwegianWord);
    $sChineseWord = trim($sChineseWord);

    $Count++;
}

fclose($rFile);

Best Solution

If your PHP script file has an ANSI encoding, instead of UTF-8, then on the byte-level those norwegian characters will be different from what they would be if they were encoded in UTF-8. Since PHP is a byte-processing language, not a text-processing language, it duly compares the byte sequences and concludes they don't match.

To resolve this, you can either make sure that your PHP script has the same encoding as the character set you're comparing against, or you can use the iconv or mbstring libraries to convert to appropriate character sets.

Also, if you haven't read it, read this: http://www.joelonsoftware.com/articles/Unicode.html

Update:
another point you take into account is to make sure that what you're passing into this function is what you think it is. If you're looping across a string one character at a time with the array indexing operator, it won't work, because your UTF-8 string might use two bytes (two array index positions) to store one character. There are functions in mbstring to copy out text from strings based on character positions, not byte positions.