lazy xml teaser

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

lazy xml teaser

Jay Carlson
Here's the start of the docs for something I've been working on. Lemme know what you think.

lazynode: lazily construct XML trees

Why is this interesting?  Trees are a natural data model to process
XML documents.  A simple tree implementation reads the entire document
into memory at once.  For large documents, this can be too expensive.
Although callback and event APIs are memory-efficient, they are
painful to program to.

Lazynode constructs an conventional-seeming XML tree as its contents
are referenced, parsing more source document as necessary to fill out
the contents on demand.

Given stylized iterators, memory consumption can be limited to
particular subtrees.  Consider:

for i,child in xpairs_consume(lz) do
  if child.attr.externalref then
    print(child.name)
    table.insert(references, child)
  end
end

where the _consume family of iterators nils out elements from their
parent before returning them.  If the body of the loop does not retain
a reference to the child elsewhere, it will become eligible for
garbage collection as soon as the next iteration begins.

Although not currently implemented, other consuming forms may interact
with the XML parser for greater savings:

<document>
  <firstname>Jay</firstname>
  <lastname>Carlson</lastname>
<bodytext>Spending too much time listening to <ref>In Utero</ref> can be [...]
  <title>I Think I'm DOM</title>
</document>

lastname, title = xmlextract.strings_consume(tree, "lastname", "title")

The strings_consume filter can potentially turn off character data and
other events inside any node it knows it doesn't need (like bodytext),
as references to them cannot possibly affect the rest of the program.

Implementation details:

Given the following XML:

<paragraph justify='centered'>first child<b>bold</b>second child</paragraph>

A lazynode will appear to have the following contents:

lz = {name="paragraph", attr={justify="centered"},
  "first child",
  {name="b", "bold", n=1},
  "second child",
  n=3
}

However, on the start of parsing, the actual underlying table will contain:

lz = {name="paragraph", attr={justify="centered"},
  _read_so_far=0
}

After a reference to lz[1], it will contain:

lz = {name="paragraph", attr={justify="centered"},
  "first child",
  _read_so_far=1
}

  And after a reference to lz[2]:

lz = {name="paragraph", attr={justify="centered"},
  "first child",
  {name="b", _read_so_far=0}
}

Note that the child is read lazily as well.  However, a reference to lz[3]
will force all of lz[2] to be completed:


lz = {name="paragraph", attr={justify="centered"},
  "first child",
  {name="b", "bold", n=1}
  "second child",
  _read_so_far=3
}

Reading either lz[4] (which is nil) or lz.n will force the completion
of the node.

Note that reading from lz.n will force the remainder of the node to be
read, as we don't know how long it's going to be until it closes.

In general, calling normal iterators on a lazy node is a bad idea, as
all kinds of goo may be present.

Under the surface, this is implemented by pulling events from a queue
generated from expat.


Reply | Threaded
Open this post in threaded view
|

RE: lazy xml teaser

Stewart
Title: RE: lazy xml teaser

Very nice, and would handle some jabber-xml stuff I'm already doing quite nicely.

Need any help with this?

-joe

-----Original Message-----
From: Jay Carlson [[hidden email]]
Sent: Sunday, February 22, 2004 10:41 PM
To: 'Lua list'
Subject: lazy xml teaser


Here's the start of the docs for something I've been working on.  Lemme
know what you think.

lazynode: lazily construct XML trees

Why is this interesting?  Trees are a natural data model to process
XML documents.  A simple tree implementation reads the entire document
into memory at once.  For large documents, this can be too expensive.
Although callback and event APIs are memory-efficient, they are
painful to program to.

Lazynode constructs an conventional-seeming XML tree as its contents
are referenced, parsing more source document as necessary to fill out
the contents on demand.

Given stylized iterators, memory consumption can be limited to
particular subtrees.  Consider:

for i,child in xpairs_consume(lz) do
   if child.attr.externalref then
     print(child.name)
     table.insert(references, child)
   end
end

where the _consume family of iterators nils out elements from their
parent before returning them.  If the body of the loop does not retain
a reference to the child elsewhere, it will become eligible for
garbage collection as soon as the next iteration begins.

Although not currently implemented, other consuming forms may interact
with the XML parser for greater savings:

<document>
   <firstname>Jay</firstname>
   <lastname>Carlson</lastname>
   <bodytext>Spending too much time listening to <ref>In Utero</ref> can
be [...]
   <title>I Think I'm DOM</title>
</document>

lastname, title = xmlextract.strings_consume(tree, "lastname", "title")

The strings_consume filter can potentially turn off character data and
other events inside any node it knows it doesn't need (like bodytext),
as references to them cannot possibly affect the rest of the program.

Implementation details:

Given the following XML:

   <paragraph justify='centered'>first child<b>bold</b>second
child</paragraph>

A lazynode will appear to have the following contents:

lz = {name="paragraph", attr={justify="centered"},
   "first child",
   {name="b", "bold", n=1},
   "second child",
   n=3
}

However, on the start of parsing, the actual underlying table will contain:

lz = {name="paragraph", attr={justify="centered"},
   _read_so_far=0
}

After a reference to lz[1], it will contain:

lz = {name="paragraph", attr={justify="centered"},
   "first child",
   _read_so_far=1
}

   And after a reference to lz[2]:

lz = {name="paragraph", attr={justify="centered"},
   "first child",
   {name="b", _read_so_far=0}
}

Note that the child is read lazily as well.  However, a reference to lz[3]
will force all of lz[2] to be completed:


lz = {name="paragraph", attr={justify="centered"},
   "first child",
   {name="b", "bold", n=1}
   "second child",
   _read_so_far=3
}

Reading either lz[4] (which is nil) or lz.n will force the completion
of the node.

Note that reading from lz.n will force the remainder of the node to be
read, as we don't know how long it's going to be until it closes.

In general, calling normal iterators on a lazy node is a bad idea, as
all kinds of goo may be present.

Under the surface, this is implemented by pulling events from a queue
generated from expat.



- - - - - - - Appended by Scientific-Atlanta, Inc. - - - - - - -
This e-mail and any attachments may contain information which is confidential, proprietary, privileged or otherwise protected by law. The information is solely intended for the named addressee (or a person responsible for delivering it to the addressee). If you are not the intended recipient of this message, you are not authorized to read, print, retain, copy or disseminate this message or any part of it. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete it from your computer.
Reply | Threaded
Open this post in threaded view
|

Re: lazy xml teaser

John Belmonte
In reply to this post by Jay Carlson
Jay Carlson wrote:
Why is this interesting?  Trees are a natural data model to process
XML documents.  A simple tree implementation reads the entire document
into memory at once.  For large documents, this can be too expensive.
Although callback and event APIs are memory-efficient, they are
painful to program to.

Lazynode constructs an conventional-seeming XML tree as its contents
are referenced, parsing more source document as necessary to fill out
the contents on demand.

I understand how this could save memory in a Lua table represent ion of an XML document. The "parsing more source document as necessary" is arguable. I've found it best to run an XML document through a validator like relax-ng before using it, relieving my program from the task of ad-hoc validation. Considering this, the XML often has already been parsed before your program gets to it.

-John


--
http:// if  ile.org/

Reply | Threaded
Open this post in threaded view
|

RE: lazy xml teaser

Stewart
In reply to this post by Jay Carlson
Title: RE: lazy xml teaser

Sometimes you don't have a complete tree to parse. This is the case with a jabber-xmlrpc project I worked on. In this case I had to do all kinds of funky things to the xml tree to parse it a node at a time.

I am intrigued by Jay's proposal.

-joe

-----Original Message-----
From: John Belmonte [[hidden email]]
Sent: Monday, February 23, 2004 11:08 AM
To: Lua list
Subject: Re: lazy xml teaser


Jay Carlson wrote:
> Why is this interesting?  Trees are a natural data model to process
> XML documents.  A simple tree implementation reads the entire document
> into memory at once.  For large documents, this can be too expensive.
> Although callback and event APIs are memory-efficient, they are
> painful to program to.
>
> Lazynode constructs an conventional-seeming XML tree as its contents
> are referenced, parsing more source document as necessary to fill out
> the contents on demand.

I understand how this could save memory in a Lua table represent ion of
an XML document.  The "parsing more source document as necessary" is
arguable.  I've found it best to run an XML document through a validator
like relax-ng before using it, relieving my program from the task of
ad-hoc validation.  Considering this, the XML often has already been
parsed before your program gets to it.

-John


--
<A rel="nofollow" href="http://" TARGET="_blank">http:// if  ile.org/



- - - - - - - Appended by Scientific-Atlanta, Inc. - - - - - - -
This e-mail and any attachments may contain information which is confidential, proprietary, privileged or otherwise protected by law. The information is solely intended for the named addressee (or a person responsible for delivering it to the addressee). If you are not the intended recipient of this message, you are not authorized to read, print, retain, copy or disseminate this message or any part of it. If you have received this e-mail in error, please notify the sender immediately by return e-mail and delete it from your computer.
Reply | Threaded
Open this post in threaded view
|

Re: lazy xml teaser

Martin Spernau
In reply to this post by John Belmonte
John Belmonte wrote:

I understand how this could save memory in a Lua table represent ion of an XML document. The "parsing more source document as necessary" is arguable. I've found it best to run an XML document through a validator like relax-ng before using it, relieving my program from the task of ad-hoc validation. Considering this, the XML often has already been parsed before your program gets to it.


True if your XML is a 'document'. But often it is a 'stream' (like in the Jabber case), so one needs to use SAX like approaches. For this I think the lazynode approach would be very nice.
-Martin