« WRF does CF - Part... | Main | Chunking Data: Why... »

Integrating a SAX Parser with the Bison Parser Generator

11 November 2012

[Update: As of bison version 3, the Java push-pull mechanism is in the standard release].

A SAX (Simple API for XML) parser is a particular mechanism for parsing XML documents. Using a SAX parser has the advantage over the DOM-based parser in that it is not necessary to build the explicit DOM tree. On the other hand, it can be difficult to build a SAX parser because it requires management of complex state.

Combining SAX parsing with a GNU Bison generated parser is appealing because it allows the Bison parser to manage all of the state. Additionally, the .y file encapsulates the equivalent of a DTD but in a much more readable form. The combination makes using SAX parsing a lot simpler.

The SAX parser operates by generating events representing "tokens" from the XML document. Consider for example, this document.

<element1>
  <element2>
  </element2>
</element1>

The SAX parser would typically generate the following events:

startDocument
startElement for element1
startElement for element2
endElement for element2
endElement for element1
endDocument

In practice, the set of possible events is more extensive, although for any given class of XML document, many of these events can be ignored.

The controlling SAX parser invokes callback procedures in a user-supplied handler class. For each event type, a specific method is called in the handle class. For an example of such a handler class, see this example.

https://www.saxproject.org/apidoc/org/xml/sax/helpers/DefaultHandler.html

The Gnu Bison parser generator (https://www.gnu.org/software/bison/) has a notion called "push-parsing" where the parser is fed tokens one by one. This is essentially the same model as a SAX parser where the SAX events serve the role of tokens and the SAX parser is, in effect, a push-parser in Bison terms.

The current Bison parsers (versions 2.6.3 and earlier) do not support Java push parsing. However, the author has contributed a Java push parser skeleton to the Bison project that supports Java push parsing. This skeleton is not yet in any distributed release, but will appear at some point.

~~In the meantime, the required skeleton can be obtained at this URL.~~

~~https://www.unidata.ucar.edu/staff/dmh/lalr1.push~~

~~and can be used with Bison version 2.6.4 using the "bison -S" flag.~~

As of Bison version 3, Java push parsing is supported in the standard release.

The primary interface to the Bison generated push-parser uses this method call.

public int push_parse(int yylextoken, Object yylexval)

where the yylextoken is the integer representing some Bison parser token and yylexval is some state information about that token.

SAX Parser Operation

The notional architecture stack for the SAX plus Bison parser looks like the following.

     -------------------
           Main         
     -------------------
         SAX Parser    
     -------------------
        Event Handler   
     -------------------
      Bison push-parser
     -------------------

Using a SAX parser with a Bison Java push-parser operates by creating a SAX parser and giving it a event handler specific to the Bison generated parser. The SAX parser is invoked, and as the SAX parser processes the XML document, it creates events and the handler is called for each event.

In turn, the handler does two things.

The handler converts each event into a Bison token and an associated state object.
The Bison parser push_parser method is invoked with that Bison token and that object.

The push_parser method can return one of three values:

YYABORT – this signals that the parse failed. In turn, the handler should throw an exception to cause the SAX parser to stop.
YYPUSH_MORE – this signals that the parse needs more input to continue.
YYACCEPT – this signals that the parse is complete. If the SAX parser continues to send events after the parser has accepted, then this should be treated as a parse error.

Any exceptions generated by either the event handler or the Bison push parser should be rewrapped as a SAXException or an IOException so that they propagate up the stack correctly.

Mapping Events to Tokens plus State

Consider this Bison grammar.

%token A_ _A B_ _B ATTR1 ATTR2
%start a
%%
a: A_ b _A ;

b: B_ attrlist _B

attrlist: /*empty*/ | attrlist ATTR1 | attrlist ATTR2 ;

The token A_ is intended to come from an occurrence of the element <A>. Similarly, the token _A is intended to come from an occurrence of the element </A>.

An example of an associated XML document might be as follows.

<A>
  <B attr1="value">
  </B>
</A>

Now consider the startElement event handling method. It might look like this.

01 public void
02 startElement(String nsuri, String name, String qualname, Attributes attributes)
03     throws SAXException
04 {
05     if(parseaccepted)
06         throw new SAXException("Events occur after parse is complete");
07     int token = 0;
08     if(name.equals("A"))
09         token = A_;
10     else if(name.equals("B"))
11         token = B_;
12     if(token == 0)
13         throw new SAXException("Unexpected attribute");
14     Lval state = new Lval();
15     state.name = name;
16     state.qualname = qualname;
17     state.nsuri = nsuri;
18     switch (bisonparser.push_parse(token,state)) {
19     case YYPUSH_MORE: break;
20     case YYACCEPT: parseaccepted = true; break;
21     case YYABORT:
22         throw new SAXException("Parser aborted");
23     }
24     if(token = B_) {
25         // pass any attributes as tokens 
26        int nattr = attributes.getLength();
27        for(int i=0;i<nattr;i++) {
28            token = 0;
29            String attributename = attributes.getLocalName(i);
30            if(attributename.equals("attr1"))
31                token = ATTR1;
32            else if(attributename.equals("attr2"))
33                token = ATTR2;
34            if(token == 0)
35                throw new SAXException("Unexpected attribute");
36            String value = attributes.getValue(i);
37            state = new Lval();
38            state.attributename = attributename;
39            state.value = value;
40            switch (bisonparser.push_parse(token,state)) {
41                case YYPUSH_MORE: break;
42                case YYACCEPT: parseaccepted = true; break;
43                case YYABORT:
44                    throw new SAXException("Parser aborted");
45            }
46        }
47     }
48 }

The first action is to test a flag (parseaccepted). If that flag is set, then the Bison parser has indicated that it thinks the parse is complete. Since that appears to differ from the "opinion" of the SAX parser, it is necessary to signal an exception (lines 5-6).

The next action (lines 8-11) is to convert the element name to a specific Bison token. The linear search shown can be improved using, say, a hash table to map the element name to a specific Bison token integer. If there is no name match, then an exception is thrown (lines 12-13).

To go along with the token, a state object (of class Lval, not shown) is created and any arguments to startElement that are of interest are copied into that Lval object (lines 14-17).

Now there is a token and associated state, so the bison push_parser method can be called (line 18).

In lines 19-23, the return value from the push_parser method is tested. If YYABORT is returned, then an exception should be thrown. If YYACCEPT is returned, then the parseaccepted flag is set to prevent further calls to the event handler. If YYPUSH_MORE is returned, then all is well, and the parse will continue.

There is an additional complication: the element may have associated attributes passed in as an argument to startElement. For each such attribute, a token and state object must be created and passed into the Bison push parser.

As with the element name, the attribute name is used to determine the corresponding Bison token (lines 29-33). If no match is found, then an exception is thrown (lines 34-35) Again, similarly, the attribute state is constructed from information taken from the attribute object (lines 36-39). Finally, the push parser is invoked with the attribute token and state as arguments (line 40) and the result checked (lines 41-44).

Other event handler are similar in structure, although usually simpler. It is instructive to look at the endDocument event handling method.

01 public void endDocument()
02     throws SAXException
03 {
04     if(parseaccepted)
05         throw new SAXException("Events occur after parse is complete");
06     switch (bisonparser.push_parse(EOF,null)) {
07     case YYACCEPT: parseaccepted = true; break;
08     case YYPUSH_MORE:
09         throw new SAXException("End Document: parser is asking for more input");
10     case YYABORT:
11         throw new SAXException("Parser aborted");
12     }
13 }

Since there should be no more events, the push parser is invoked with the EOF token to tell the Bison parser that this is the end of the token stream (line 6). The state is null because there are no arguments to endDocument.

The return result from the push parser must be YYACCEPT. Any other value indicates an error and an exception is thrown (lines 8-11).

At this point the originally invokes SAX parser should return with an indication of success (or it throws and exception).

Grammar File Format

The grammar file (the .y file) will require some initial declarations to support the use of a Bison push parser.

01 %define api.push-pull push
02 %code lexer {
03 public Object getLVal() {return null;}
04 public int yylex() {return 0;}
05 public void yyerror(String s) {System.err.println(s);}
06 }

Line 1 tells Bison that it should produce a push parser. Specifically, this will cause the generation of this method.

public int push_parse(int yylextoken, Object yylexval)

Lines 2-6 require some explanation. Technically, the Bison parser has no need for access to a lexer because any output from a lexer is passed into the Bison parser through the push_parser method arguments. However, and because of a quirk in the way Bison constructs its parsers, the parser does need access to the lexer's yyerror procedure, so we do need at least a stub lexer to the parser. This is handled by defining the lexer using the Bison %code lexer {...} mechanism. The only method in the lexer interface that will be called is yyerror, so the others are stubs. The actual action of yyerror can, of course, be defined as desired.

Invoking bison

Given a .y grammar file as specified above, ~~the only special flag that is needed is -S. So~~ bison is invoked as follows.

    bison -Ljava -S lalr1.push <.y file name>
    bison -Ljava <.y file name>

Conclusion

Now that Bison can generate Java push parsers, SAX parsing becomes a lot simpler. Traditional SAX parsers require maintenance of complex state to track the state of the parser. Here, the Bison generated parse does all that automatically. The result is a much more compact definition and implementation of a SAX parser.

Posted by $entry.creator.screenName