xparser is a streaming callback-mode xml parser, written for the Lua scripting language. xparser is a simple parser. It is not a fully w3c compliant, validating parser - it (currently) only supports 8 bit ascii input, and simply skips over any DOCTYPE declaration. The parser is not itself namespace-aware; namespace-qualified names (qnames) are simply passed to the caller as-is. Namespace handling is demonstrated in the sample "xml2table" module.
The basic parser is a simple state machine, with only three public functions, init(), parse() and reset(). init() is called to initialise the parser. reset() simply resets the internal state of the parser. parse() processes a chunk of input data, and returns either "OK", "ERROR", or a callback-supplied value. "OK" is a request for more input data, while "ERROR" signals a parsing error. note: there is no "end of document" event or return code, as an XML document can have trailing comments and PI's. parse does, however, return a boolean flag following "OK", which indicates that the root element has been parsed. The parser is constructed (or initalised) with a table of functions to handle parser events. These functions will be called from parse(), and can interrupt parse(), supplying any return value(s). For example, the xml2table event handlers will return "DONE" when the root element has been parsed, and "ERROR" when namespace errors are detected.
An XML parser is constructed by calling xparser.create() with a table of functions to handle parsing events. create() returns a new parser instance. Each created parser instance has its own environment, allowing simultaneous parsing by multiple parsers. The parser's :parse() member function is then called repeatedly with chunks of xml text, and the event handlers will be called as xml elements are processed. parse() returns the string "OK" on successive calls until a complete xml document has been parsed, an event handler returns a non-nil value, or an error occurs.
Parameter | Type | Meaning |
---|---|---|
event handlers | table | table of event handler functions, indexed by event name. |
context | (any) | optional context object. This will be passed as the first parameter to each event handler. |
... | (any) | additional parameters for the init event handler. |
Returns a new parser object, or nil, error message if an error occurs.
If a table of event handlers is supplied, the table's init function will be called, if it exists, with the supplied context, followed by any additional parameter(s) passed to create(). If the first value returned from the init event handler is false, create() will return nil, followed by any second return value (ie. error message). Otherwise, if the first value returned from the init event handler is anything other than nil, it will replace the supplied context.
All these methods take a parser object as their first parameter, so can be called as parser:func()
Parameter | Type | Meaning |
---|---|---|
event handlers | table | table of event handler functions, indexed by event name. |
context | (any) | optional context object. This will be passed to each event handler when it is called. |
... | (any) | additional parameters for the init event handler. |
Returns true on success. If the initialisation failed, returns nil, error message.
init() takes the same parameters as xparser.create(), and they have the same meaning. init() can be called at any time to re-configure a parser instance with new event handlers and context. The init event handler will be called, if it exists, with the supplied context and any additional parameter(s). If the first value returned from the init event handler is false, init() will return nil, followed by the second return value (error message). Otherwise, if the first value returned from the init event handler is anything other than nil, it will replace the supplied context.
Parameter | Type | Meaning |
---|---|---|
... | (any) | optional parameters for the reset event handler. |
All return values from the reset event handler, if present. Otherwise true.
reset() can be called at any time to reset a parser instance and context. If there is a reset event handler, it will be called, with the current context, followed by the parameter(s) passed to parser:reset().
Parameter | Type | Meaning |
---|---|---|
str | string | all or part of an xml document. parse() can be called repeatedly with successive blocks of text. |
parse() will consume the input string, calling the appropriate event handler on xml parsing events, and return the string "OK" unless it encounters a parsing error, or an event handler returns a value. On error, parse() will return "ERROR" and an error message. Event handlers can cause parse() to return any value they choose, by returning a value that equates to true. If any event handler returns a true value, parse() be interrupted, and will return that first three values from the event handler. Otherwise, parse() will return "OK", followed by a booloean flag to indicate whether a document element has been parsed. note: There is no "end of document" event or state, as an XML document can have unlimited trailing comments and PI's.
none
Returns a string indicating state; either "ERROR" or an internal code. A state of "START" indicates that the parser has not yet processed any data. A state of "ROOT" indicates that the parser is at the root of the document, ie. has no open elements of any kind. Any other state indicates that the input xml document is unfinished. The state "ROOT" may be encountered more than once, if the document is parsed in chunks, ie the state is "ROOT" immediately after parsing the declaration.
none
Return values are 3 numbers: character count,line count, column in current line
none
number : the current depth of nesting.
none
context : the current parser context.
none
context : the current parser context.
Event handlers are declared as a table of functions, indexed by event name. They are all optional. There are two special events, init() and reset(). They are provided for context management, and are caller by parser:init() and parser:reset(). The other events are generated during parsing, and are called from parser:parse(). Any return values from an xml event handler will be returned to the caller of parser:parse().
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | context supplied to parser:create() or parser:init() |
... | (any) | additional parameters supplied to parser:create() or parser:init() |
INIT() should return the initialised context, if anything. A return that is not nil or false will become the new context. If INIT() returns false when called from parser:create(), parser:create() will return nil, followed by the second return value; ie no parser will be constructed. A return that is not nil or false will become the new context. Note: the value following a false return is to be used for an error message.
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current context, as supplied to parser:init(), or returned by the INIT() callback. |
... | (any) | additional parameters supplied to parser:reset() |
Any values returned by the RESET callback will be returned to the caller of parser:reset().
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current context, as supplied to parser:init(), or returned by the INIT() callback. |
... | (any) | additional parameters supplied to parser:destroy() |
Any values returned by the TERM callback will be returned to the caller of parser:destroy().
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
attributes | table | attributes, as qname=value pairs. |
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
attributes | table | attributes, as qname=value pairs. |
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
qname | string | qualified name of element |
attributes | table | attributes, as qname=value pairs. |
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
qname | string | qualified name of element |
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
qname | string | qualified name of element |
attributes | table | attributes, as qname=value pairs. |
EMPTY_ELEMENT will be returned in place of START_ELEMENT and END_ELEMENT, ie. an empty element will not return a START_ELEMENT / END_ELEMENT pair.
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
text | string | normalised text content |
TEXT may be returned multiple times within the same element, if text is interspersed with other nodes (eg comments). The text value will have been whitespace-normalised before being returned to the caller.
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
text | string | literal CDATA content |
CDATA may be returned multiple times within the same element, if CDATA is interspersed with other child nodes.
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
text | string | literal comment content |
COMMENT may be returned multiple times within the same element, if comments are interspersed with other child nodes.
Parameter | Type | Meaning |
---|---|---|
parser | (xparser) | the parser object. |
context | (any) | current parser context. |
name | string | PI name |
body | string | PI body |
PI may be returned multiple times within the same element, if PI elements are interspersed with other child nodes.
event_handler = { init= function(context,...) print("init") end, reset= function(context,...) print("reset") end, START= function(context,attributes) print("START") end, XML= function(context,attributes) print("XML") end, START_ELEMENT=function(context,name,attributes) print("START_ELEMENT",name) end, END_ELEMENT= function(context,name) print("END_ELEMENT",name) end, EMPTY_ELEMENT=function(context,name,attributes) print("EMPTY_ELEMENT",name) end, TEXT= function(context,text) print("TEXT",text) end, COMMENT= function(context,text) print("COMMENT",text) end, CDATA= function(context,text) print("CDATA",text) end, PI= function(context,name,body) print("PI",name,body) end, } parser=xparser.create(event_handler) xml=[==[ <?xml version='1.0'?> <doc >doctxt<!-- comment at doc level --> <text>doctxt tag ttt </text> <foo > <bar /><!-- comment at foo level --> <bar max='1' /> <bar /> <bar max='2' /> </foo> <foo id='xx1'>foo2a <attributes id='zz'>dummy attr tag</attributes> <bar id='xx2' /> <bar id='xx3' max='3' /> <bar ><![CDATA[cdata at bar level]]></bar> <bar max='4' ><?ProgInst blah blah blah ?></bar> <bar id='xx4' max='5' /> foo2b </foo> </doc> <?ProgInst after doc element ?> <!-- trailing comment --> ]==] print(parser:parse(xml))
xml2table.lua is a module that converts a XML stream into a referencable Lua table. This table created by xml2table is used by the SOAP parser.
xml2table is a callback table for a simple xml parser. construct by calling xparser.create(xml2table) parse xml by calling parser:parse(xml string) returns a lua table from supplied xml. An xml document is considered to be made up of a nested tree of nodes, each node being one of element, text, CDATA, comment or PI. Only element nodes can contain child nodes. Each node parsed is converted to a lua table, with the following members
node type table element type="element", name=element qualified name (string), attributes = table of attributes of the form{name=value}, CDATA type="cdata", value=cdata contents (string) comment type="comment", value=comment (string) PI type="pi", name=PI qualified name (string), value=Pi contents (string) nodes are inserted in their parent node in the order they are encountered (1..n). For convenience, "element" nodes have the following additional members [name] (table ref) : reference to the (first) child element node with that name pi (table ref) : reference to the first child PI node (if present) comment (string) : content of the first child comment node (if present) text (string) : content of the first child text node (if present) cdata (string) : content of the first child CDATA node (if present) The table returned from :parse() has the following structure { xml = table of attributes from the "<?xml" element, as {name=value} pairs [1] = root (element) node of xml document. [root element name] = [1] ... }
Each node contains (at least) these fields:
-- sample code -- require("xml2table") local xml="<foo attr1='attribute one'><bar>bar text</bar><baz/><baz/></foo>" local parser=xparser.create(xml2table) local st,doc,error = parser:parse(xml) if st == "MORE" then end -- incomplete document if st ~= "DONE" or not doc then end -- error case print(doc.foo.bar._text) doc = { xml = nil, [1] = { type="element", name="foo", attributes={"attr1"="attribute one"}, [1] = { type="element", name="bar", text="bar text", [1] = { type="text", value="bar text"} }, [2] = { type="element", name="baz", }, [3] = { type="element", name="baz", }, ["bar"]= [1] ["baz"]= [2] }, ["foo"] = [1], }