Has anyone been able to develop a syntax-free LR(k) (for finite k) grammar for parsing generics? The proposed grammar in spec8.pdf is hideous, as it has several ambiguities and even some typos. The two biggest problems I have in a context-free parser is deciding whether >> should be one or two tokens, and whether a<b starts a parameterized type name or is a relational expression. I know these two problems can be solved by writing a context-sensitive parser, but I would prefer not to do that if possible. I assume (without looking at the sources, because I'm trying to do a clean-room implementation) that the prototype compiler is using a context-sensitive parser.
I also am interested if any of the JSR folks would be willing to let the public know where the current syntax of the JSR or Java 1.5 differs from the ideas in the public draft. Remember the last minute change of the assert syntax (from "assert(a, b)" to "assert a:b"), 11 months after the public draft closed, made in order to acheive LR(1) parsing?
I've come close to writing an LR(2) grammar that gets everything except for casts. And I may even be able to trim that to LR(1) without too much headache. The problem with casts, however, is shown below:
Suppose an LR(2) parser is at "a" on this sequence:
class A { void m() { Object o = ( a < b
We could reduce "a" to ClassOrInterface, and ultimately have a cast expression, as in:
class A { void m() { Object o = ( a < b > ) null; } }
And since an infinite number of tokens can appear in both ReferenceTypeList and ShiftExpression, there is no finite way to determine which reduction to choose.
Has anyone been able to develop a syntax-free LR(k)
(for finite k) grammar for parsing generics? The
proposed grammar in spec8.pdf is hideous, as it has
several ambiguities and even some typos. The two
biggest problems I have in a context-free parser is
deciding whether >> should be one or two tokens,
This question is explicitly handled in the spec. It is always
one token.
See http://www.cs.princeton.edu/~appel/modern/java/CUP/
for how Scott Ananian handled things.
I am not interested in a parser generator targetting Java (I'm working with the Jikes compiler, where the parser generator targets C++). However, the ideas in CUP might be useful, if I ever look at it.
I concede that the grammar in the public draft, WHEN CORRECTED, will correctly parse a<b<c>> when the tokenizer always treats >> as one token (it makes the grammar messy, but it worked). Below, I list the LALR(1) grammar that I have so far, with comments on the rules I had to fix so far. This grammar still does not allow (a<b>)o (as I stated earlier, this is highly ambiguous), nor does it allow a<b>.class (but the public draft did not mention this as being legal, and I think it would return the erased type anyway, so the user should have just used a.class).
If anyone can produce an LR(k) way to allow parameterized type casts, I'd love to see it!
By the way, there was a BugParade entry I once saw that suggested an alternate syntax for casts. Of course, the original syntax would have to be preserved for casts to raw types, so that the new compiler can still parse old code. But since parameterized types are new syntax, casting to a parameterized type could safely require new syntax. The proposal was that casts be allowed with this syntax: myvar.(MyClass). The semantics are that the compile type of expression myvar is cast to type MyClass. Notice that you can then write CastExpression as:
Now parameterized type cases would have to use the new form, but you no longer have grammar ambiguity. And since casts to parameterized types should be rare (after all, the point of generics is to reduce the number of explicit casts), it is not too much to ask for this new syntax.
This grammar started from the LALR(1) version in JLS1 chapter 18. It
has been tweaked to correctly parse everything of JLS2, and to add
generics (JSR 14). Everything added or modified for generics is
preceded by the comment GENERIC. Some rules of JLS1 were simplified in the interest of a simpler parse tree with code reuse, which occaisionally requires more work during semantic processing.
// I used ClassOrInterface instead of Name, to allow things like A<B>.C
// GENERIC
ClassOrInterfaceType : ClassOrInterface
| ClassOrInterface '<'Identifier' instead of Name, to have a distinction between
// expression names and type names (since only type names can have type
// arguments).
// GENERIC
ClassOrInterface : 'Identifier'
| ClassOrInterfaceType '.' 'Identifier'
// The ArrayType rule makes more sense like this.
ArrayType : PrimitiveType Dims
| ClassOrInterfaceType Dims
// I deleted ClassType and InterfaceType for grammar simplicity.
Super : 'extends' ClassOrInterfaceType
Interfaces : 'implements' TypeList
// TypeList is the combination of InterfaceTypeList and ClassTypeList
TypeList : ClassOrInterfaceType
| TypeList ',' ClassOrInterfaceType
ClassBody : '{' ClassBodyDeclarationsopt '}'
ClassBodyDeclarations : ClassBodyDeclaration
| ClassBodyDeclarations ClassBodyDeclaration
// Since initializers are similar, I lumped them together into
// InitializerDeclaration. Likewise, since interface and class members
// are similar, I lumped them into MemberDeclaration. Of course, this
// requires more work on the semantic side of things, but makes
// parsing easier.
ClassBodyDeclaration : MemberDeclaration
| ConstructorDeclaration
| InitializerDeclaration
// Note that type declaration includes ';', the empty member declaration
MemberDeclaration : FieldDeclaration
| MethodDeclaration
| TypeDeclaration
FieldDeclaration : Modifiersopt Type VariableDeclarators ';'
VariableDeclarators : VariableDeclarator
| VariableDeclarators ',' VariableDeclarator
VariableDeclarator : VariableDeclaratorId
| VariableDeclaratorId '=' VariableInitializer
VariableDeclaratorId : 'Identifier' Dimsopt
VariableInitializer : Expression
| ArrayInitializer
// Similar to JLS2 chapter 18, I treat ExplicitConstructorInvocations
// as statements, then perform semantic filtering to make sure they
// only appear in the correct places. Then MethodBody can be shared in
// more places.
MethodDeclaration : MethodHeader MethodBody
| MethodHeader ';'
// To avoid ambiguity with FieldDeclarations, TypeParametersopt and
// ResultType must be expanded inline.
// GENERIC
MethodHeader : Modifiersopt TypeParameters Type MethodDeclarator Throwsopt
| Modifiersopt Type MethodDeclarator Throwsopt
| Modifiersopt TypeParameters 'void' MethodDeclarator Throwsopt
| Modifiersopt 'void' MethodDeclarator Throwsopt
MethodDeclarator : 'Identifier' '(' FormalParameterListopt ')' Dimsopt
FormalParameterList : FormalParameter
| FormalParameterList ',' FormalParameter
FormalParameter : 'final'opt Type VariableDeclaratorId
Throws : 'throws' TypeList
// See comment for MethodDeclaration
MethodBody : Block
InitializerDeclaration : 'static'opt MethodBody
I am not interested in a parser generator targetting Java (I'm working
with the Jikes compiler, where the parser generator targets C++).
I think you're missing the main point. CUP is an LALR(1) parser, and there's a CUP grammar for the Java language (including generics support) on that site. That means there's an LALR(1) grammar for generics there.
You missed my point. I looked at CUP, and it is a CONTEXT-SENSITIVE LALR grammar. I am interested in a CONTEXT-FREE LALR grammar.
In CUP, Scott invented the terminal PLT, a LT with lookahead for differentiating whether "(a<b" starts an expression (as in "(a<b)") or a cast to parameterized type (as in "(a<b>)o"). Yes, I could do that with my parser (and in the end I will probably have to, since no one else has come up with a better solution), but it is context sensitive - it requires the lexer to do lookahead!
I have FINALLY managed to write an LALR(1) grammar for generics, with NO LOOKAHEAD in the lexer. The solution for casts to parameterized types is rather ugly, but it works. If you are interested in seeing the grammar, it is in the CVS repository of jikes, under the generics-branch tag, in the file src/java.g: http://www-124.ibm.com/developerworks/oss/cvs/jikes/~checkout~/jikes/src/java.g?rev=1.31.2.17&content-type=text/plain
I hope to translate that grammar to CUP, and inform Scott Ananian of my success, in the next few days.
Just a quick note to this forum that I've updated my CUP grammar to fix the deficiencies Eric noted. It is still available from
http://www.cs.princeton.edu/~appel/modern/java/CUP/
and linked to from
http://cag.lcs.mit.edu/~cananian/Projects/GJ/
Thanks for the bug reports, Eric!
My grammar still uses the "smart lexer" look-ahead feature, but I've looked over Eric's work and it appears quite clever.
the rules I had to fix so far. This grammar still
does not allow (a<b>)o (as I stated earlier, this is
highly ambiguous), nor does it allow a<b>.class (but
the public draft did not mention this as being legal,
and I think it would return the erased type anyway, so
the user should have just used a.class).