RelaxNGCC Algorithm

This page describes how RelaxNGCC generates Java source code. It also assumes you are familiar with automaton.

This explanation is not a detailed algorithm but an outline. If you hava any question about internal behavior of RelaxNGCC, please analyse source code or send me via mail.

Step 1 Dividing into scopes

A scope is a block that consists of elements, attributes, and data defined by a start or a define element of a RelaxNG grammar. A grammar is devided into one or more scopes by processing include elements and combine attributes.

RelaxNGCC generates an automaton and a Java class per a scope.

Step 2 Constructing automaton

Next step is generating an automaton about each scope. In this automaton, there are following 5 kinds of alphabet.

The kinds "startElement", "endElement", and "attribute" have properties about namespace-URI and localname. We call the pair of the namespace-URI and the localname "name" simply. Additionally, we introduce an expression "startElement[x]" for a startElement type alphabet whose name is "x".

A "ref" type alphabet is identified by a scope it refers to. We introduce an expression ref[S] for a ref type alphabet that refers to a scope "S" and another expression scope(r) for a scope that is referred to by a ref type alphabet "r".

On the other hand, the value type alphabet has a datatype property. The data type is detailed in W3C XML Schema Part2, and RelaxNGCC uses datatype.

For example, following scope

<define name="foo">
  <element name="x">
    <oneOrMore>
      <element name="y"><text/></element>
    </oneOrMore>
    <attribute name="a"><text/></attribute>
  </element>
</define>

generates following automaton

  startE[x]   startE[y]   value[string]  endE[y]     attribute[a]   endE[x]
S --------> O --------> O -----------> O --------> O ------------> O ------> F
                        ∧                         |
                        |__________________________|
                               startE[y]

where S is the initial state and F is the final state.

Here is another example.

<define name="foo">
  <zeroOrMore>
    <element name="y"><data type="int"/></element>
  </zeroOrMore>
  <choice>
    <ref name="a"/><
    <ref name="b"/><
  </choice>
</define>
                     startE[y]
                ------------------------
                |                      |
      startE[y] v value[int]   endE[y] |
|-- S --------> O ---------> O ------> O --
|   |                                  |  |
|   |   ref[a]           ref[a]        |  |
|   ------------> F <-------------------  |
----------------> F <----------------------
     ref[b]              ref[b]

Note that there are no alphabets that represent the end of attribute.
We introduce a expression tr(x,a,y) for a transition from a state x to y with an alphabet a.

Step 3 Adding actions defined by user

The result source code generated by RelaxNGCC must execute the code fragment embedded by the user at appropriate moment. This moment is associated with a transition of the automaton constructed in Step 2.

<define name="foo">
  <element name="x">
    <oneOrMore>
      <element name="y"><text c:alias="t"/><c:java>System.out.println(t);</c:java></element>
    </oneOrMore>
    <attribute name="a"><text/></attribute>
  </element>
</define>
  start "x"   start "y"   value[string]  end "y"     attribute "a"   end "x"
S --------> O --------> O -----------> O --------> O ------------> O ------> F
                        ^                          |
                        |__________________________|
                               start "y"

In the example above, the code fragment System.out.println(t); is executed at the emphasized transition. Thus, RelaxNGCC associates the code fragment embedded by user with transitions of automaton.

Step 4 FIRST and FOLLOW

For each scope S, we define a set of alphabets FIRST(S) as ollows.

a ∈ FIRST(S) ⇔
  A Transition (initial(S),a,j) exists, where a is not "ref" type. Or,
  a ∈ FIRST(S'), and a transition (initial(S),r,j) exists where r is a ref type alphabet and scope(r)=S'.

Note that initial(S) is the initial state of the automaton constructed from S.

Next, we define a set of alphabet FOLLOW(S) as follows.

a ∈ FOLLOW(S) ⇔ If there is a ref type alphabet r in a scope S' where scope(r)=S, two transition tr(x,r,y) and tr(y,a,z) must exist in the scope S.

Frankly speaking, FIRST alphabet notifies start of an scope and FOLLOW alphabet notifies end of an scope. I borrowed the concept of FIRST and FOLLOW from the compiler theory about LR context free grammar. RelaxNGCC computes FIRST and FOLLOW for all scopes.

Step 5 Determinacy check

As for RelaxNGCC, the automaton constructed by each scope must be deterministic. It is possible theoretically to avoid this restriction by looking ahead more tokens, but it is hard to implement.

An automaton generated by a scope X is non-deterministic if a state O that satisfies any of following conditions exists.

  1. Multiple transition from O with an identical alphabet exist.
  2. A transition from O with a ref type alphabet r exists, and another transition from O with an alphabet in FIRST(scope(r)).
  3. Two ref type transition from O with ref type alphabet r and r' exist, and the intersection of FIRST(scope(r)) and FIRST(scope(r')) is not empty.
  4. O is an accept state and a transition from O with an alphabet in FOLLOW(X) exists. This condition is not irrelevant to the determinacy of the automaton, but violation of this condition exceeds the ability of RelaxNGCC since it causes ambifuousity of end of scope.

Here is an example of violating the second condition. At the head of choice a transition with startElement[x] is allowd, though startElement[x] is contained the FIRST of the scope with name a.

<define name="a">
  <element name="x">...</element>
</define>
<start>
  <choice>
    <element name="x">...</element>
    <ref name="a"/>
  </choice>
</start>

Here is an example of violating the fourth condition. A transition with startElement[x] is allowed from the accept state of the scope a, however the line (*) indicates the FOLLOW of the scope contains startElement[x].

<define name="a">
  <oneOrMore><element name="x">...</element></oneOrMore>
</define>
<start>
  <ref name="a"/>
  <element name="x">...</element> (*)
</start>

If the determinacy check is failed, RelaxNGCC outputs warning message and quits.

Step 6 Generating code

Then RelaxNGCC generates source code. The trigger of transition is SAX events such as startElement, endElement, and so on. But a special rule is needed since attributes are notified only at startElement.

Additionally, here is rules of the start and end of scope.

In terms of SAX, the ContentHandler is replaced at the start of a scope, and it is recovered at the end of the scope.


RelaxNGCC home