RelaxNGCC Algorithm

This page describes how RelaxNGCC generates Java source code. It also assumes you are familiar with automaton.

This explanation is not a detailed algorithm but an outline. If you hava any question about internal behavior of RelaxNGCC, please analyse source code or send me via mail.

Step 1　Dividing into scopes

A scope is a block that consists of elements, attributes, and data defined by a start or a define element of a RelaxNG grammar. A grammar is devided into one or more scopes by processing include elements and combine attributes.

RelaxNGCC generates an automaton and a Java class per a scope.

Step 2　Constructing automaton

Next step is generating an automaton about each scope. In this automaton, there are following 5 kinds of alphabet.

startElement
endElement
attribute
ref
value

The kinds "startElement", "endElement", and "attribute" have properties about namespace-URI and localname. We call the pair of the namespace-URI and the localname "name" simply. Additionally, we introduce an expression "startElement[x]" for a startElement type alphabet whose name is "x".

A "ref" type alphabet is identified by a scope it refers to. We introduce an expression ref[S] for a ref type alphabet that refers to a scope "S" and another expression scope(r) for a scope that is referred to by a ref type alphabet "r".

On the other hand, the value type alphabet has a datatype property. The data type is detailed in W3C XML Schema Part2, and RelaxNGCC uses datatype.

For example, following scope

<define name="foo">
  <element name="x">
    <oneOrMore>
      <element name="y"><text/></element>
    </oneOrMore>
    <attribute name="a"><text/></attribute>
  </element>
</define>

generates following automaton

  startE[x]   startE[y]   value[string]  endE[y]     attribute[a]   endE[x]
S --------> O --------> O -----------> O --------> O ------------> O ------> F
                        ∧                         |
                        |__________________________|
                               startE[y]

where S is the initial state and F is the final state.

Here is another example.

<define name="foo">
  <zeroOrMore>
    <element name="y"><data type="int"/></element>
  </zeroOrMore>
  <choice>
    <ref name="a"/><
    <ref name="b"/><
  </choice>
</define>

                     startE[y]
                ------------------------
                |                      |
      startE[y] v value[int]   endE[y] |
|-- S --------> O ---------> O ------> O --
|   |                                  |  |
|   |   ref[a]           ref[a]        |  |
|   ------------> F <-------------------  |
----------------> F <----------------------
     ref[b]              ref[b]

Note that there are no alphabets that represent the end of attribute.
We introduce a expression tr(x,a,y) for a transition from a state x to y with an alphabet a.

Step 3　Adding actions defined by user

The result source code generated by RelaxNGCC must execute the code fragment embedded by the user at appropriate moment. This moment is associated with a transition of the automaton constructed in Step 2.

<define name="foo">
  <element name="x">
    <oneOrMore>
      <element name="y"><text c:alias="t"/><c:java>System.out.println(t);</c:java></element>
    </oneOrMore>
    <attribute name="a"><text/></attribute>
  </element>
</define>

  start "x"   start "y"   value[string]  end "y"     attribute "a"   end "x"
S --------> O --------> O -----------> O --------> O ------------> O ------> F
                        ^                          |
                        |__________________________|
                               start "y"

In the example above, the code fragment System.out.println(t); is executed at the emphasized transition. Thus, RelaxNGCC associates the code fragment embedded by user with transitions of automaton.

Step 4　FIRST and FOLLOW

For each scope S, we define a set of alphabets FIRST(S) as ollows.

a ∈ FIRST(S) ⇔
A Transition (initial(S),a,j) exists, where a is not "ref" type. Or,
a ∈ FIRST(S'), and a transition (initial(S),r,j) exists where r is a ref type alphabet and scope(r)=S'.

Note that initial(S) is the initial state of the automaton constructed from S.

Next, we define a set of alphabet FOLLOW(S) as follows.

a ∈ FOLLOW(S) ⇔ If there is a ref type alphabet r in a scope S' where scope(r)=S, two transition tr(x,r,y) and tr(y,a,z) must exist in the scope S.

Frankly speaking, FIRST alphabet notifies start of an scope and FOLLOW alphabet notifies end of an scope. I borrowed the concept of FIRST and FOLLOW from the compiler theory about LR context free grammar. RelaxNGCC computes FIRST and FOLLOW for all scopes.

Step 5　Determinacy check

As for RelaxNGCC, the automaton constructed by each scope must be deterministic. It is possible theoretically to avoid this restriction by looking ahead more tokens, but it is hard to implement.

An automaton generated by a scope X is non-deterministic if a state O that satisfies any of following conditions exists.

Multiple transition from O with an identical alphabet exist.
A transition from O with a ref type alphabet r exists, and another transition from O with an alphabet in FIRST(scope(r)).
Two ref type transition from O with ref type alphabet r and r' exist, and the intersection of FIRST(scope(r)) and FIRST(scope(r')) is not empty.
O is an accept state and a transition from O with an alphabet in FOLLOW(X) exists. This condition is not irrelevant to the determinacy of the automaton, but violation of this condition exceeds the ability of RelaxNGCC since it causes ambifuousity of end of scope.

Here is an example of violating the second condition. At the head of choice a transition with startElement[x] is allowd, though startElement[x] is contained the FIRST of the scope with name a.

<define name="a">
  <element name="x">...</element>
</define>
<start>
  <choice>
    <element name="x">...</element>
    <ref name="a"/>
  </choice>
</start>

Here is an example of violating the fourth condition. A transition with startElement[x] is allowed from the accept state of the scope a, however the line (*) indicates the FOLLOW of the scope contains startElement[x].

<define name="a">
  <oneOrMore><element name="x">...</element></oneOrMore>
</define>
<start>
  <ref name="a"/>
  <element name="x">...</element> (*)
</start>

If the determinacy check is failed, RelaxNGCC outputs warning message and quits.

Step 6　Generating code

Then RelaxNGCC generates source code. The trigger of transition is SAX events such as startElement, endElement, and so on. But a special rule is needed since attributes are notified only at startElement.

When the code receives a set of attributes at startElement, the code pushes it into a stack.
At endElement the code pops the attribute set from the stack.
On every transition, the code checks if the destination state of the transition satisfies all of following conditions:
- A transition with attribute type alphabet x is allowd.
- The attribute set located on the top of the stack contains an attribute whose name is the name of x.
If so, the transitions with x and the text of the value of x are processed.

Additionally, here is rules of the start and end of scope.

If the current state O has a transition with ref type alphabet r and the code receives an alphabet x contained in FIRST(scope(r)),
- preserves the destination state P,
- sets the current state the initial state of the automaton of scope(r),
- and the transition with x is processed.
If the current state O is accept state and the code receives an alphabet x contained in FOLLOW(scope(O)),
- sets the current state the preserved state P,
- and the transition with x is processed.

In terms of SAX, the ContentHandler is replaced at the start of a scope, and it is recovered at the end of the scope.

RelaxNGCC home

RelaxNGCC Algorithm

Step 1 Dividing into scopes

Step 2 Constructing automaton

Step 3 Adding actions defined by user

Step 4 FIRST and FOLLOW