Good solution to parse HTML?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Good solution to parse HTML?

Nereus
Hello

I'd like to write a script to extract parts of an HTML page. Since Lua is so small, it looks like a good match to run on an appliance.

A bit of research shows that it's not a good idea to use a regex engine, and people recommend using an XML parser.

Is there a good tool I could use in Lua to parse HTML?

Thank you.
Reply | Threaded
Open this post in threaded view
|

Re: Good solution to parse HTML?

Aapo Talvensaari
On 18 November 2015 at 15:55, Nereus <[hidden email]> wrote:
A bit of research shows that it's not a good idea to use a regex engine, and
people recommend using an XML parser.

Is there a good tool I could use in Lua to parse HTML?

I would recommend using HTML parser, such as this:
Reply | Threaded
Open this post in threaded view
|

Re: Good solution to parse HTML?

Luiz Henrique de Figueiredo
In reply to this post by Nereus
> Is there a good tool I could use in Lua to parse HTML?

The problem with HTML of course is that it is not XML in the sense
thjat HTML in the wild will most probably not always conform to any
standards.

if you want a simple XML -> Lua table converter, try my lxml:

        http://www.tecgraf.puc-rio.br/~lhf/ftp/lua/#lxml

Try also
        http://doc.lubyk.org/xml.html

Google found these:

        https://github.com/msva/lua-htmlparser
        https://github.com/luaforge/html/tree/master/html

Reply | Threaded
Open this post in threaded view
|

Re: Good solution to parse HTML?

Eduardo Ochs
In reply to this post by Aapo Talvensaari
On Wed, Nov 18, 2015 at 3:06 PM, Aapo Talvensaari
<[hidden email]> wrote:
> On 18 November 2015 at 15:55, Nereus <[hidden email]> wrote:
>> (...)
>> Is there a good tool I could use in Lua to parse HTML?
>
> I would recommend using HTML parser, such as this:
> https://github.com/craigbarnes/lua-gumbo


By the way, anyone here knows how to _use_ lua-gumbo?
I just tried again my scripts for downloading, compiling and
installing gumbo-parser and lua-gumbo, which are:

  rm -Rfv ~/usrc/gumbo-parser/
  cd      ~/usrc/
  git clone --depth 1 https://github.com/google/gumbo-parser

  cd ~/usrc/gumbo-parser/
  sh ./autogen.sh    2>&1 | tee oa
  ./configure        2>&1 | tee oc
  make               2>&1 | tee om
  sudo make install  2>&1 | tee omi

  rm -Rfv ~/usrc/lua-gumbo/
  cd      ~/usrc/
  git clone --depth 1 https://github.com/craigbarnes/lua-gumbo
  cd      ~/usrc/lua-gumbo/
  make              2>&1 | tee om
  make check        2>&1 | tee omc
  sudo make install 2>&1 | tee omi

and now the "make check" in lua-gumbo passes only 2 of the tests, and
fails the other 19 ones... anyway, I've never been able to use
lua-gumbo for even the simplest things, like extracting the title of
an HTML page...

  Cheers =),
    Eduardo Ochs
    http://angg.twu.net/

Reply | Threaded
Open this post in threaded view
|

Re: Good solution to parse HTML?

Eduardo Ochs
On Wed, Nov 18, 2015 at 4:50 PM, Eduardo Ochs <[hidden email]> wrote:

> On Wed, Nov 18, 2015 at 3:06 PM, Aapo Talvensaari
> <[hidden email]> wrote:
>> On 18 November 2015 at 15:55, Nereus <[hidden email]> wrote:
>>> (...)
>>> Is there a good tool I could use in Lua to parse HTML?
>>
>> I would recommend using HTML parser, such as this:
>> https://github.com/craigbarnes/lua-gumbo
>
> By the way, anyone here knows how to _use_ lua-gumbo?
> I just tried again my scripts for downloading, compiling and
> installing gumbo-parser and lua-gumbo, which are:
>
>   rm -Rfv ~/usrc/gumbo-parser/
>   cd      ~/usrc/
>   git clone --depth 1 https://github.com/google/gumbo-parser
>
>   cd ~/usrc/gumbo-parser/
>   sh ./autogen.sh    2>&1 | tee oa
>   ./configure        2>&1 | tee oc
>   make               2>&1 | tee om
>   sudo make install  2>&1 | tee omi
>
>   rm -Rfv ~/usrc/lua-gumbo/
>   cd      ~/usrc/
>   git clone --depth 1 https://github.com/craigbarnes/lua-gumbo
>   cd      ~/usrc/lua-gumbo/
>   make              2>&1 | tee om
>   make check        2>&1 | tee omc
>   sudo make install 2>&1 | tee omi
>
> and now the "make check" in lua-gumbo passes only 2 of the tests, and
> fails the other 19 ones... anyway, I've never been able to use
> lua-gumbo for even the simplest things, like extracting the title of
> an HTML page...


Update (with thanks to Craig Barnes!):
all that was missing was an "ldconfig" to make the new library in
/usr/local/bin/ be recognized... this works:

 (eepitch-shell)
 (eepitch-kill)
 (eepitch-shell)
  rm -Rfv ~/usrc/gumbo-parser/
  cd      ~/usrc/
  git clone --depth 1 https://github.com/google/gumbo-parser
  cd      ~/usrc/gumbo-parser/
  sh ./autogen.sh
  ./configure
  make
  sudo make install
  sudo ldconfig

  rm -Rfv ~/usrc/lua-gumbo/
  cd      ~/usrc/
  git clone --depth 1 https://github.com/craigbarnes/lua-gumbo
  cd      ~/usrc/lua-gumbo/
  make
  make check
  sudo make install

  lua5.1
    parse = require("gumbo").parse
    print(parse("<title>Hello  world!</title>").title)

Cheers! =)
  Eduardo Ochs
  [hidden email]
  http://angg.twu.net/

Reply | Threaded
Open this post in threaded view
|

Re: Good solution to parse HTML?

Ericson Carlos
In reply to this post by Nereus
On Wed, Nov 18, 2015 at 9:55 PM, Nereus <[hidden email]> wrote:
Hello

I'd like to write a script to extract parts of an HTML page. Since Lua is so
small, it looks like a good match to run on an appliance.

A bit of research shows that it's not a good idea to use a regex engine, and
people recommend using an XML parser.

Is there a good tool I could use in Lua to parse HTML?

Thank you.



--
View this message in context: http://lua.2524044.n2.nabble.com/Good-solution-to-parse-HTML-tp7670415.html
Sent from the Lua-l mailing list archive at Nabble.com.


An option is to run it through htmltidy to convert to XHTML. If conversion succeeds then you can use an XML parser.

Reply | Threaded
Open this post in threaded view
|

Re: Good solution to parse HTML?

Kenneth LO
In reply to this post by Nereus
Take a look into lua-gumbo.  Simple code to extract tables looks like this.

gumbo = require('gumbo')
infile = arg[1]
doc=gumbo.parseFile(infile)
t=doc:getElementsByTagName('table')
for i=1,#t do
 print(t[i].outerHTML)
end

On Wed, Nov 18, 2015 at 9:55 PM, Nereus <[hidden email]> wrote:
Hello

I'd like to write a script to extract parts of an HTML page. Since Lua is so
small, it looks like a good match to run on an appliance.

A bit of research shows that it's not a good idea to use a regex engine, and
people recommend using an XML parser.

Is there a good tool I could use in Lua to parse HTML?

Thank you.



--
View this message in context: http://lua.2524044.n2.nabble.com/Good-solution-to-parse-HTML-tp7670415.html
Sent from the Lua-l mailing list archive at Nabble.com.