Java – Recognizing extended characters using JAVACC

asciiextended-asciijavacc

I'm creating a grammar using JavaCC and have run across a small problem. I'm trying to allow for any valid character within the ASCII extended set to be recognized by the resulting compiler. After looking at the same JavaCC examples (primarily the example showing the JavaCC Grammer itself) I set up the following token to recognize my characters:

< CHARACTER:

  (   (~["'"," ","\\","\n","\r"])
    | ("\\"
        ( ["n","t","b","r","f","\\","'","\""]
        | ["0"-"7"] ( ["0"-"7"] )?
        | ["0"-"3"] ["0"-"7"] ["0"-"7"]
        )
      )
  )

>

If I'm understanding this correctly it should be matching on the octal representation of all of the ASCII characters, from 0-377 (which covers all 256 characters in the Extended ASCII Set). This performs as expected for all keyboard characters (a-z, 0-9, ?,./ etc) and even for most special characters (© , ¬ ®).
However, whenever I attempt to parse the 'trademark' symbol (™) my parser continually throws an End of File exception, indicating that it is unable to recognize the symbol. Is there some obvious way that I can enhance my definition of a character to allow the trademark symbol to be accepted?

Best Solution

I had similar a issue for recognizing special symbols of a text file (either CP1252 or ISO-8859-1 encoded) which was read to a String before parsing. My solution was adding the UNICODE_INPUT to the grammar header:

options {
  UNICODE_INPUT=true;
}

Worked like a breeze.

More information on JavaCC options: http://javacc.java.net/doc/javaccgrm.html

Related Question