xparser - Lua XML Parser

Introduction

xparser is a streaming callback-mode xml parser, written for the Lua scripting language. xparser is a simple parser. It is not a fully w3c compliant, validating parser - it (currently) only supports 8 bit ascii input, and simply skips over any DOCTYPE declaration. The parser is not itself namespace-aware; namespace-qualified names (qnames) are simply passed to the caller as-is. Namespace handling is demonstrated in the sample "xml2table" module.

The basic parser is a simple state machine, with only three public functions, init(), parse() and reset(). init() is called to initialise the parser. reset() simply resets the internal state of the parser. parse() processes a chunk of input data, and returns either "OK", "ERROR", or a callback-supplied value. "OK" is a request for more input data, while "ERROR" signals a parsing error. note: there is no "end of document" event or return code, as an XML document can have trailing comments and PI's. parse does, however, return a boolean flag following "OK", which indicates that the root element has been parsed. The parser is constructed (or initalised) with a table of functions to handle parser events. These functions will be called from parse(), and can interrupt parse(), supplying any return value(s). For example, the xml2table event handlers will return "DONE" when the root element has been parsed, and "ERROR" when namespace errors are detected.

Implementation

An XML parser is constructed by calling xparser.create() with a table of functions to handle parsing events. create() returns a new parser instance. Each created parser instance has its own environment, allowing simultaneous parsing by multiple parsers. The parser's :parse() member function is then called repeatedly with chunks of xml text, and the event handlers will be called as xml elements are processed. parse() returns the string "OK" on successive calls until a complete xml document has been parsed, an event handler returns a non-nil value, or an error occurs.

event list (parameters)

START
This is returned on the first call to parse after init() or reset().
XML (attributes)
This is returned on parsing an <?xml ... ?> declaration.
START_ELEMENT (name, attributes)
This is returned on parsing the opening tag of an element, i.e. <foo ... >
END_ELEMENT (name)
This is returned on parsing the closing tag of an element, i.e. </foo>
EMPTY_ELEMENT (name, attributes)
This is returned on parsing an empty element, i.e. <foo a='b' />
TEXT (string)
This is returned on parsing text content (for the current element).
CDATA (string)
This is returned on parsing a CDATA node.
COMMENT (string)
This is returned when we have parsed a comment (<!--...-->) node.
PI (name, body)
This is returned when we have parsed a PI (<?...?>) node.

xparser module functions

xparser.create(event handlers,context,...)
This creates a new parser object (table) , with its own state.
Parameter Type Meaning
event handlers table table of event handler functions, indexed by event name.
context (any) optional context object. This will be passed as the first parameter to each event handler.
... (any) additional parameters for the init event handler.

Return values

Returns a new parser object, or nil, error message if an error occurs.

Details

If a table of event handlers is supplied, the table's init function will be called, if it exists, with the supplied context, followed by any additional parameter(s) passed to create(). If the first value returned from the init event handler is false, create() will return nil, followed by any second return value (ie. error message). Otherwise, if the first value returned from the init event handler is anything other than nil, it will replace the supplied context.

xparser object methods

All these methods take a parser object as their first parameter, so can be called as parser:func()

parser:init(callbacks,context,...)
Initialise a parser object's callbacks and context.
Parameter Type Meaning
event handlers table table of event handler functions, indexed by event name.
context (any) optional context object. This will be passed to each event handler when it is called.
... (any) additional parameters for the init event handler.

Return values

Returns true on success. If the initialisation failed, returns nil, error message.

Details

init() takes the same parameters as xparser.create(), and they have the same meaning. init() can be called at any time to re-configure a parser instance with new event handlers and context. The init event handler will be called, if it exists, with the supplied context and any additional parameter(s). If the first value returned from the init event handler is false, init() will return nil, followed by the second return value (error message). Otherwise, if the first value returned from the init event handler is anything other than nil, it will replace the supplied context.

parser:reset(...)
Reset a parser object's internal state and context.

Parameters

Parameter Type Meaning
... (any) optional parameters for the reset event handler.

Return values

All return values from the reset event handler, if present. Otherwise true.

Details

reset() can be called at any time to reset a parser instance and context. If there is a reset event handler, it will be called, with the current context, followed by the parameter(s) passed to parser:reset().

parser:parse(str)
Parses the given text, calling event handlers as required.

Parameters

Parameter Type Meaning
str string all or part of an xml document. parse() can be called repeatedly with successive blocks of text.

Return values

Details

parse() will consume the input string, calling the appropriate event handler on xml parsing events, and return the string "OK" unless it encounters a parsing error, or an event handler returns a value. On error, parse() will return "ERROR" and an error message. Event handlers can cause parse() to return any value they choose, by returning a value that equates to true. If any event handler returns a true value, parse() be interrupted, and will return that first three values from the event handler. Otherwise, parse() will return "OK", followed by a booloean flag to indicate whether a document element has been parsed. note: There is no "end of document" event or state, as an XML document can have unlimited trailing comments and PI's.

parser:get_state()
Returns a parser object's internal state.

Parameters

none

Return values

Returns a string indicating state; either "ERROR" or an internal code. A state of "START" indicates that the parser has not yet processed any data. A state of "ROOT" indicates that the parser is at the root of the document, ie. has no open elements of any kind. Any other state indicates that the input xml document is unfinished. The state "ROOT" may be encountered more than once, if the document is parsed in chunks, ie the state is "ROOT" immediately after parsing the declaration.

parser:get_pos()
Returns the parsers position within the source document.

Parameters

none

Return values

Return values are 3 numbers: character count,line count, column in current line

parser:get_depth()
Returns the current element nesting depth.

Parameters

none

Return values

number : the current depth of nesting.

parser:get_context()
Returns the current parser context.

Parameters

none

Return values

context : the current parser context.

parser:get_events()
Returns the current event handler table.

Parameters

none

Return values

context : the current parser context.

xparser object variables

_env
environment table. INTERNAL USE ONLY!

event handlers

Event handlers are declared as a table of functions, indexed by event name. They are all optional. There are two special events, init() and reset(). They are provided for context management, and are caller by parser:init() and parser:reset(). The other events are generated during parsing, and are called from parser:parse(). Any return values from an xml event handler will be returned to the caller of parser:parse().

special event handlers

INIT(parser,context,...)
This is called by parser:create() and parser:init(). It allows context initialisation and/or creation.

Parameters

Parameter Type Meaning
parser (xparser) the parser object.
context (any) context supplied to parser:create() or parser:init()
... (any) additional parameters supplied to parser:create() or parser:init()

Return values

INIT() should return the initialised context, if anything. A return that is not nil or false will become the new context. If INIT() returns false when called from parser:create(), parser:create() will return nil, followed by the second return value; ie no parser will be constructed. A return that is not nil or false will become the new context. Note: the value following a false return is to be used for an error message.

RESET(parser,context,...)
This is called by parser:reset(). It allows context reset.

Parameters

Parameter Type Meaning
parser (xparser) the parser object.
context (any) current context, as supplied to parser:init(), or returned by the INIT() callback.
... (any) additional parameters supplied to parser:reset()

Return values

Any values returned by the RESET callback will be returned to the caller of parser:reset().

TERM(parser,context,...)
This is called by parser:destroy(). It allows context destruction.

Parameters

Parameter Type Meaning
parser (xparser) the parser object.
context (any) current context, as supplied to parser:init(), or returned by the INIT() callback.
... (any) additional parameters supplied to parser:destroy()

Return values

Any values returned by the TERM callback will be returned to the caller of parser:destroy().

xml event handlers

START(parser, context)
This is called before parsing the first character.
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
attributes table attributes, as qname=value pairs.
XML(parser, context, attributes)
This is called on parsing an <?xml ... ?> declaration.
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
attributes table attributes, as qname=value pairs.
START_ELEMENT(parser, context, qname, attributes)
This is called on parsing the opening tag of an element, i.e. <foo ... >
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
qname string qualified name of element
attributes table attributes, as qname=value pairs.
END_ELEMENT(parser, context, qname)
This is called on parsing the closing tag of an element, i.e. </foo>
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
qname string qualified name of element
EMPTY_ELEMENT(parser, context, qname, attributes)
This is called on parsing an empty element, i.e. <foo a='b' />
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
qname string qualified name of element
attributes table attributes, as qname=value pairs.

Details

EMPTY_ELEMENT will be returned in place of START_ELEMENT and END_ELEMENT, ie. an empty element will not return a START_ELEMENT / END_ELEMENT pair.

TEXT(parser, context, text)
This is called on parsing text content (for the current element).
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
text string normalised text content

Details

TEXT may be returned multiple times within the same element, if text is interspersed with other nodes (eg comments). The text value will have been whitespace-normalised before being returned to the caller.

CDATA(parser, context, text)
This is called on parsing a CDATA node.
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
text string literal CDATA content

Details

CDATA may be returned multiple times within the same element, if CDATA is interspersed with other child nodes.

COMMENT(parser, context, text)
This is called when we have parsed a comment (<!--...-->) node.
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
text string literal comment content

Details

COMMENT may be returned multiple times within the same element, if comments are interspersed with other child nodes.

PI(parser, context, name, body)
This is called when we have parsed a PI (<?...?>) node.
Parameter Type Meaning
parser (xparser) the parser object.
context (any) current parser context.
name string PI name
body string PI body

Details

PI may be returned multiple times within the same element, if PI elements are interspersed with other child nodes.

sample event handlers

Example 1.


event_handler = {
  init=         function(context,...)              print("init") end,
  reset=        function(context,...)              print("reset") end,
  START=        function(context,attributes)       print("START") end,
  XML=          function(context,attributes)       print("XML") end,
  START_ELEMENT=function(context,name,attributes)  print("START_ELEMENT",name) end,
  END_ELEMENT=  function(context,name)             print("END_ELEMENT",name) end,
  EMPTY_ELEMENT=function(context,name,attributes)  print("EMPTY_ELEMENT",name) end,
  TEXT=         function(context,text)             print("TEXT",text) end,
  COMMENT=      function(context,text)             print("COMMENT",text) end,
  CDATA=        function(context,text)             print("CDATA",text) end,
  PI=           function(context,name,body)        print("PI",name,body) end,

}


parser=xparser.create(event_handler)

xml=[==[
<?xml version='1.0'?>
<doc >doctxt<!-- comment at doc level -->
  <text>doctxt tag    ttt </text>
  <foo >
    <bar /><!-- comment at foo level -->
    <bar max='1' />
    <bar />
    <bar max='2' />
  </foo>
  <foo id='xx1'>foo2a
    <attributes id='zz'>dummy attr tag</attributes>
    <bar id='xx2' />
    <bar id='xx3' max='3' />
    <bar ><![CDATA[cdata at bar level]]></bar>
    <bar max='4' ><?ProgInst blah blah blah ?></bar>
    <bar id='xx4' max='5' />
    foo2b
  </foo>
</doc>
<?ProgInst after doc element ?>
<!-- trailing comment -->
]==]

print(parser:parse(xml))


Example 2. xml2table.lua

xml2table.lua is a module that converts a XML stream into a referencable Lua table. This table created by xml2table is used by the SOAP parser.

xml2table is a callback table for a simple xml parser. construct by calling xparser.create(xml2table) parse xml by calling parser:parse(xml string) returns a lua table from supplied xml. An xml document is considered to be made up of a nested tree of nodes, each node being one of element, text, CDATA, comment or PI. Only element nodes can contain child nodes. Each node parsed is converted to a lua table, with the following members

  node type   table
  element     type="element",
              name=element qualified name (string),
              attributes = table of attributes of the form{name=value},


  CDATA       type="cdata",
              value=cdata contents (string)


  comment     type="comment",
              value=comment (string)


  PI          type="pi",
              name=PI qualified name (string),
              value=Pi contents (string)


  nodes are inserted in their parent node in the
  order they are encountered (1..n).

  For convenience, "element" nodes have the following additional members

  [name] (table ref) : reference to the (first) child element node with
                      that name
  pi (table ref) : reference to the first child PI node (if present)
  comment (string) : content of the first child comment node (if present)
  text (string) : content of the first child text node (if present)
  cdata (string) : content of the first child CDATA node (if present)


  The table returned from :parse() has the following structure

  {
    xml = table of attributes from the "<?xml" element,
            as {name=value} pairs
    [1]  = root (element) node of xml document.
    [root element name] = [1]
        ...
  }

Each node contains (at least) these fields:

  -- sample code --
require("xml2table")

local xml="<foo attr1='attribute one'><bar>bar text</bar><baz/><baz/></foo>"
local parser=xparser.create(xml2table)
local st,doc,error = parser:parse(xml)

if st == "MORE" then end -- incomplete document
if st ~= "DONE" or not doc then end -- error case

print(doc.foo.bar._text)

doc = {
  xml = nil,
  [1] = {
    type="element",
    name="foo",
    attributes={"attr1"="attribute one"},
    [1] = {
      type="element",
      name="bar",
      text="bar text",
      [1] = { type="text", value="bar text"}
    },
    [2] = {
      type="element",
      name="baz",
    },
    [3] = {
      type="element",
      name="baz",
    },
    ["bar"]= [1]
    ["baz"]= [2]
  },
  ["foo"] = [1],
}

LUA libraries required