JavaMan wrote:
I am trying to parse C++/Java style source files and would like to isolate comment, string literal, and whitespace as tokens.
Shouldn't you match char literals as well? Consider:
char c = '"';
The double quote should not be considered as the start of a string literal!
JavaMan wrote:
In short, the 2 pairs /**/ and "" will interfere with one another.
Err, no. If a /*
is "seen" first, it would consume all the way to the first */
. For input like:
/* comment...."comment that looks like a string literal"...more comment */
this would mean the double quotes are also consumed. The same for string literals: when a double quote is seen first, the /*
and/or */
would be consumed until the next (un-escaped) "
is encountered.
Or did I misunderstand?
Note that you can drop the options {greedy=false;}:
from your grammar before .*
or .+
which are by default ungreedy.
Here's a way:
grammar T;
parse
: (t=.
{
if($t.type != OTHER) {
System.out.printf("\%-10s >\%s<\n", tokenNames[$t.type], $t.text);
}
}
)+
EOF
;
ML_COMMENT
: '/*' .* '*/'
;
SL_COMMENT
: '//' ~('\r' | '\n')*
;
STRING
: '"' (STR_ESC | ~('\\' | '"' | '\r' | '\n'))* '"'
;
CHAR
: '\'' (CH_ESC | ~('\\' | '\'' | '\r' | '\n')) '\''
;
SPACE
: (' ' | '\t' | '\r' | '\n')+
;
OTHER
: . // fall-through rule: matches any char if none of the above matched
;
fragment STR_ESC
: '\\' ('\\' | '"' | 't' | 'n' | 'r') // add more: Unicode esapes, ...
;
fragment CH_ESC
: '\\' ('\\' | '\'' | 't' | 'n' | 'r') // add more: Unicode esapes, Octal, ...
;
which can be tested with:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"String s = \" foo \\t /* bar */ baz\";\n" +
"char c = '\"'; // comment /* here\n" +
"/* multi \"no string\"\n" +
" line */";
System.out.println(source + "\n-------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
If you run the class above, the following is printed to the console:
String s = " foo \t /* bar */ baz";
char c = '"'; // comment /* here
/* multi "no string"
line */
-------------------------
SPACE > <
SPACE > <
SPACE > <
STRING >" foo \t /* bar */ baz"<
SPACE >
<
SPACE > <
SPACE > <
SPACE > <
CHAR >'"'<
SPACE > <
SL_COMMENT >// comment /* here<
SPACE >
<
ML_COMMENT >/* multi "no string"
line */<
I highly recommend correcting all instances of this warning in code of any importance.
This warning was created (by me actually) to alert you to situations like the following:
shiftExpr : ID (('<<' | '>>') ID)?;
Since ANTLR 4 encourages action code be written in separate files in the target language instead of embedding them directly in the grammar, it's important to be able to distinguish between <<
and >>
. If tokens were not explicitly created for these operators, they will be assigned arbitrary types and no named constants will be available for referencing them.
This warning also helps avoid the following problematic situations:
- A parser rule contains a misspelled token reference. Without the warning, this could lead to silent creation of an additional token that may never be matched.
A parser rule contains an unintentional token reference, such as the following:
number : zero | INTEGER;
zero : '0'; // <-- this implicit definition causes 0 to get its own token
Best Solution
The problem is that you seem to want to perform both syntactical and semantical checking in your lexer and/or your parser. It's a common mistake, and something that is only possible in very simple languages.
What you really need to do is accept more broadly in the lexer and parser, and then perform semantic checks. How strict you are in your lexing is up to you, but you have two basic options, depending on whether or not you need to accept zeroes preceding your days of the month: 1) Be really accepting for your INTs, 2) define DATENUM to only accept those tokens that are valid days, yet not valid INTs. I recommend the second because there will be less semantic checks necessary later in the code (since INTs will then be verifiable at the syntax level and you'll only need to perform semantic checks on your dates. The first approach:
The second approach:
After accepting using these rules in the lexer, your date field would be either:
or:
After that, you would perform a semantic run over your AST to make sure that your dates are valid.
If you're dead set on performing semantic checks in your grammar, however, ANTLR allows semantic predicates in the parser, so you could make a date field that checks the values like this:
When you do this, however, you are embedding language specific code in your grammar, and it won't be portable across targets.