Parsing big compressed XML files

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing big compressed XML files

Valerio
Hello,
I was wondering if you are aware of an efficient way to parse a big compressed XML file with Lua (or LuaJIT).
The files store the wikipedia dumps: http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name pages-meta-history).
Currently we do it in Java, but it takes ages to parse all those files (around 18 hours).
The parsing consists in extracting the timestamp of the each of the versions for a page in the dump.

Suggestions ? 

Thanks,
Valerio 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Peng Zhicheng
On 04/03/2014 06:01 PM, Valerio Schiavoni wrote:

> Hello,
> I was wondering if you are aware of an efficient way to parse a big compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps: http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name pages-meta-history).
> Currently we do it in Java, but it takes ages to parse all those files (around 18 hours).
> The parsing consists in extracting the timestamp of the each of the versions for a page in the dump.
>
> Suggestions ?
>
> Thanks,
> Valerio

Could you please give some more details or some examples of the XML schema you want to parse?

As a general suggestion, you should use `lzlib' to read compressed data, together with a
SAX type xml parser (such as lxp or XLAXML) to deal with _HUGE_ xml input files.

But if your schema is _simple enough_ and the xml file is guaranteed to be _well formatted_,
you might not need a full functional xml parser (even in SAX style) at all.
in such cases, some simple carefully written pattern matching might be enough.

Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Luiz Henrique de Figueiredo
In reply to this post by Valerio
> I was wondering if you are aware of an efficient way to parse a big
> compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps:
> http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name
> pages-meta-history).
> The parsing consists in extracting the timestamp of the each of the
> versions for a page in the dump.

Since these XML files seem to be formatted with one field by line,
a simple loop like the one below should suffice.

for l in io.lines("enwiki-20140304-pages-meta-current1.xml-p000000010p000010000") do
        if l:match("<timestamp>")
        or l:match("<title>")
        or l:match("<id>")
        then
                print(l)
        end
end

Of course, you probably need to do a more sophisticated parsing.

Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Valerio
Hello Luiz
thanks for your reply. But, your solution seems to suggest I first need to decompress before hand the (potentially huge) compressed file. 
Is that the case? Do you think it is possible to do it directly on the compressed file, or by pipe'ing its content with zcat for instance ? 

Best,
valerio


On Thu, Apr 3, 2014 at 2:28 PM, Luiz Henrique de Figueiredo <[hidden email]> wrote:
> I was wondering if you are aware of an efficient way to parse a big
> compressed XML file with Lua (or LuaJIT).
> The files store the wikipedia dumps:
> http://dumps.wikimedia.org/enwiki/20140304/ (all those files with name
> pages-meta-history).
> The parsing consists in extracting the timestamp of the each of the
> versions for a page in the dump.

Since these XML files seem to be formatted with one field by line,
a simple loop like the one below should suffice.

for l in io.lines("enwiki-20140304-pages-meta-current1.xml-p000000010p000010000") do
        if l:match("<timestamp>")
        or l:match("<title>")
        or l:match("<id>")
        then
                print(l)
        end
end

Of course, you probably need to do a more sophisticated parsing.


Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Luiz Henrique de Figueiredo
> Do you think it is possible to do it directly on the
> compressed file, or by pipe'ing its content with zcat for instance ?

Sure. Try
        local f=io.popen("bzcat enwiki-20140304-pages-meta-current1.xml-p000000010p000010000.bz2")
and the
        for l in f:lines() do ... end

Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Oliver Kroth
In reply to this post by Valerio
You may also try one of the zlib libraries that create an interface
between Lua and the zLib.
This may save you from have to inflate (decompress) the complete file first.

-- Oliver

Am 03.04.2014 16:55, schrieb Valerio Schiavoni:
> Hello Luiz
> thanks for your reply. But, your solution seems to suggest I first
> need to decompress before hand the (potentially huge) compressed file.
> Is that the case? Do you think it is possible to do it directly on the
> compressed file, or by pipe'ing its content with zcat for instance ?
>
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Valerio
Hello Olivier,
do you suggest something like : https://github.com/brimworks/lua-zlib ?
Saving the decompressing time would be indeed awesome, but I thought bzcat is basically doing the same, no ?  


On Thu, Apr 3, 2014 at 6:09 PM, Oliver Kroth <[hidden email]> wrote:
You may also try one of the zlib libraries that create an interface between Lua and the zLib.
This may save you from have to inflate (decompress) the complete file first.

-- Oliver

Am 03.04.2014 16:55, schrieb Valerio Schiavoni:

Hello Luiz
thanks for your reply. But, your solution seems to suggest I first need to decompress before hand the (potentially huge) compressed file.
Is that the case? Do you think it is possible to do it directly on the compressed file, or by pipe'ing its content with zcat for instance ?





Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Oliver Kroth
Hello Valerio,

yes, actually I am using exactly this library in my project.

You won't get rid of the overall time to decompress the data, but you may
-  decrease the delay between starting to read the compressed file and
parsing the XML
-  reduce the memory and disk space used during the parsing

Is your XML data just plain deflated or is there a PKZIP archive
structure around it? Then you would need something to navigate through
it, which can be done in pure Lua

-- Oliver
> Hello Olivier,
> do you suggest something like : https://github.com/brimworks/lua-zlib ?
> Saving the decompressing time would be indeed awesome, but I thought
> bzcat is basically doing the same, no ?
>


Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Petite Abeille
In reply to this post by Valerio

On Apr 3, 2014, at 12:01 PM, Valerio Schiavoni <[hidden email]> wrote:

> Currently we do it in Java, but it takes ages to parse all those files (around 18 hours).

18 hours?! Including download? Or?

> The parsing consists in extracting the timestamp of the each of the versions for a page in the dump.

So, broadly speaking:

$ curl http://dumps.wikimedia.org/enwiki/20140203/enwiki-20140203-pages-articles-multistream.xml.bz2 | bzcat | xml2

The above file is around 10.6 GB compressed… the entire processing  runs end-to-end in about 120 minutes on a rather pedestrian network connection, on a rather diminutive laptop… not quite apples to apples, but still, a far cry from 18 hours… me guess there is room for improvement :)

(xml parsing curtesy of xml2: http://www.ofb.net/~egnor/xml2/ )
Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Valerio
Hello,

On Thu, Apr 3, 2014 at 8:04 PM, Petite Abeille <[hidden email]> wrote:
18 hours?! Including download? Or?

The above file is around 10.6 GB compressed… the entire processing  runs end-to-end in about 120 minutes on a rather pedestrian network connection,

18 hours is the cumulative time for _all_ the files , not 18 hours per file :-)
 
Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Petite Abeille

On Apr 4, 2014, at 12:00 AM, Valerio Schiavoni <[hidden email]> wrote:

> 18 hours is the cumulative time for _all_ the files , not 18 hours per file :-)

Aha… makes more sense… ok, so, as of April 4th, there was 161 'pages-meta-history’ files, ranging in size from 80 MB to 31 GB…

Looking at the largest compressed file, it takes a whopping 5 hours to inflate on my consumer grade system:

$ time bzcat < enwiki-20140304-pages-meta-history16.xml-p005043453p005137507.bz2 > /dev/null

real 309m38.471s
user 305m0.095s
sys 1m55.005s

A bit overwhelming for my little setup. I hope you have a big hardware budget :D
Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Valerio
We have a big cluster, but that to exploit it for this task we might need some map-reduce, which we don't. 
OTHO, there are those uber-ly compressed .7z files,  orders of magnitude smaller than the bz2. I wonder If I can inflate those files with bzcat as well..

I've only found this script which seems to provide a "7zcat" tool:

PATH=${GZIP_BINDIR-'/bin'}:$PATH
exec 7z e -so -bd "$@" 2>/dev/null | cat  


It'd be interesting to see if you get better results on your hardware... 

On Mon, Apr 7, 2014 at 12:02 AM, Petite Abeille <[hidden email]> wrote:

On Apr 4, 2014, at 12:00 AM, Valerio Schiavoni <[hidden email]> wrote:

> 18 hours is the cumulative time for _all_ the files , not 18 hours per file :-)

Aha… makes more sense… ok, so, as of April 4th, there was 161 'pages-meta-history’ files, ranging in size from 80 MB to 31 GB…

Looking at the largest compressed file, it takes a whopping 5 hours to inflate on my consumer grade system:

$ time bzcat < enwiki-20140304-pages-meta-history16.xml-p005043453p005137507.bz2 > /dev/null

real    309m38.471s
user    305m0.095s
sys     1m55.005s

A bit overwhelming for my little setup. I hope you have a big hardware budget :D

Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Valerio
And for your curiosity, on one of the smaller files, i get sensible differences between 7z and bz2 :

$time 7z e -so -bd enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z 2>/dev/null > /dev/null
7z e -so -bd enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z   6.10s user 0.02s system 99% cpu 6.120 total

$time bzcat < enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 > /dev/null                
bzcat < enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 >   61.26s user 0.14s system 99% cpu 1:01.41 total





On Mon, Apr 7, 2014 at 12:50 AM, Valerio Schiavoni <[hidden email]> wrote:
We have a big cluster, but that to exploit it for this task we might need some map-reduce, which we don't. 
OTHO, there are those uber-ly compressed .7z files,  orders of magnitude smaller than the bz2. I wonder If I can inflate those files with bzcat as well..

I've only found this script which seems to provide a "7zcat" tool:

PATH=${GZIP_BINDIR-'/bin'}:$PATH
exec 7z e -so -bd "$@" 2>/dev/null | cat  


It'd be interesting to see if you get better results on your hardware... 


On Mon, Apr 7, 2014 at 12:02 AM, Petite Abeille <[hidden email]> wrote:

On Apr 4, 2014, at 12:00 AM, Valerio Schiavoni <[hidden email]> wrote:

> 18 hours is the cumulative time for _all_ the files , not 18 hours per file :-)

Aha… makes more sense… ok, so, as of April 4th, there was 161 'pages-meta-history’ files, ranging in size from 80 MB to 31 GB…

Looking at the largest compressed file, it takes a whopping 5 hours to inflate on my consumer grade system:

$ time bzcat < enwiki-20140304-pages-meta-history16.xml-p005043453p005137507.bz2 > /dev/null

real    309m38.471s
user    305m0.095s
sys     1m55.005s

A bit overwhelming for my little setup. I hope you have a big hardware budget :D


Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

KHMan
On 4/7/2014 7:03 AM, Valerio Schiavoni wrote:

> And for your curiosity, on one of the smaller files, i get
> sensible differences between 7z and bz2 :
>
> $time 7z e -so -bd
> enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z
> 2>/dev/null > /dev/null
> 7z e -so -bd
> enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.7z
> 6.10s user 0.02s system 99% cpu 6.120 total
>
> $time bzcat <
> enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 >
> /dev/null
> bzcat <
> enwiki-20140304-pages-meta-history8.xml-p000662352p000665000.bz2 >
>    61.26s user 0.14s system 99% cpu 1:01.41 total

It's strange Wikipedia has not moved to xz but is using a mix of
7z and bzip2, even the Linux kernel has moved to tar.xz. Both 7z
and xz uses the newer LZMA2.

Unfortunately BWT+Huffman in bzip2 has roughly symmetrical times
for compression-decompression. 7z/xz decompression is not
symmetrical to compression, and will always be a lot faster than
bzip2 at big block sizes. For multiple runs, recompressing the
kaboodle to 7z/xz will probably greatly improve your runtimes.
Things like lzo is a lot faster but will likely compress to only
50% or so for text data.

--
Cheers,
Kein-Hong Man (esq.)
Kuala Lumpur, Malaysia


Reply | Threaded
Open this post in threaded view
|

Re: Parsing big compressed XML files

Jeff Pohlmeyer
In reply to this post by Valerio
On Sun, Apr 6, 2014 at 5:50 PM, Valerio Schiavoni wrote:

> We have a big cluster, but that to exploit it for this task we might need
> some map-reduce, which we don't.


Maybe this could speed things up a bit:
http://lbzip2.org/

 - Jeff