HTML Parser Recommendation

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

HTML Parser Recommendation

Chris Datfung
I'm looking for a recommendation for an HTML parser that is supported in both lua 5.1 and 5.2.

Thanks,
Chris
Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Daniel Silverstone
On Thu, May 23, 2013 at 05:15:27PM +0300, Chris Datfung wrote:
> I'm looking for a recommendation for an HTML parser that is supported in
> both lua 5.1 and 5.2.

Parsing HTML is very hard to get right, so you're unlikely to find a pure Lua
version of a parser.

I think there's a hubbub binding: https://github.com/craigbarnes/lua-hubbub

That only seems to be for 5.1, but since it's pure C it'll probably be
reasonably easy to port for 5.2.

D.

--
Daniel Silverstone                         http://www.digital-scurf.org/
PGP mail accepted and encouraged.            Key Id: 3CCE BABE 206C 3B69

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Rob Kendrick-2
In reply to this post by Chris Datfung
On Thu, May 23, 2013 at 05:15:27PM +0300, Chris Datfung wrote:
> I'm looking for a recommendation for an HTML parser that is supported in
> both lua 5.1 and 5.2.

Do you need a tree builder, too?

Perhaps this: https://github.com/craigbarnes/lua-hubbub

B.

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Craig Barnes
On 23 May 2013 15:28, Rob Kendrick <[hidden email]> wrote:

> On Thu, May 23, 2013 at 05:15:27PM +0300, Chris Datfung wrote:
>> I'm looking for a recommendation for an HTML parser that is supported in
>> both lua 5.1 and 5.2.
>
> Do you need a tree builder, too?
>
> Perhaps this: https://github.com/craigbarnes/lua-hubbub
>
> B.
>

That binding is still incomplete at the moment but I will
get around to finishing it next week or so. I've been mindful
of 5.2 support, so it should be trivial to port. Both the
Hubbub internals and the API/binding use C99, so MSVC
is not supported, if that matters.

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Rob Kendrick-2
On Thu, May 23, 2013 at 04:35:22PM +0100, Craig Barnes wrote:

> On 23 May 2013 15:28, Rob Kendrick <[hidden email]> wrote:
> > On Thu, May 23, 2013 at 05:15:27PM +0300, Chris Datfung wrote:
> >> I'm looking for a recommendation for an HTML parser that is supported in
> >> both lua 5.1 and 5.2.
> >
> > Do you need a tree builder, too?
> >
> > Perhaps this: https://github.com/craigbarnes/lua-hubbub
>
> That binding is still incomplete at the moment but I will
> get around to finishing it next week or so. I've been mindful
> of 5.2 support, so it should be trivial to port. Both the
> Hubbub internals and the API/binding use C99, so MSVC
> is not supported, if that matters.

You are of course welcome on #netsurf or netsurf-dev if you need any
assistance :)

(I'm a member of the NetSurf project, from which hubbub comes.)

B.

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Wesley Smith
In reply to this post by Craig Barnes
> That binding is still incomplete at the moment but I will
> get around to finishing it next week or so. I've been mindful
> of 5.2 support, so it should be trivial to port. Both the
> Hubbub internals and the API/binding use C99, so MSVC
> is not supported, if that matters.

Couldn't you just compile it as C++ and add extern"C" where necessary?

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Rob Kendrick-2
On Thu, May 23, 2013 at 02:25:11PM -0700, Wesley Smith wrote:
> > That binding is still incomplete at the moment but I will
> > get around to finishing it next week or so. I've been mindful
> > of 5.2 support, so it should be trivial to port. Both the
> > Hubbub internals and the API/binding use C99, so MSVC
> > is not supported, if that matters.
>
> Couldn't you just compile it as C++ and add extern"C" where necessary?

We don't support that, or compilers that can't cope with 13 year old
standards.

B.

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Coda Highland
On Thu, May 23, 2013 at 2:41 PM, Rob Kendrick <[hidden email]> wrote:

> On Thu, May 23, 2013 at 02:25:11PM -0700, Wesley Smith wrote:
>> > That binding is still incomplete at the moment but I will
>> > get around to finishing it next week or so. I've been mindful
>> > of 5.2 support, so it should be trivial to port. Both the
>> > Hubbub internals and the API/binding use C99, so MSVC
>> > is not supported, if that matters.
>>
>> Couldn't you just compile it as C++ and add extern"C" where necessary?
>
> We don't support that, or compilers that can't cope with 13 year old
> standards.
>
> B.
>

MSVC doesn't support C99 because it's a C++ compiler. The languages
are divergent. (It DOES support C++11 pretty well.) Sort of silly to
be criticizing a compiler for not supporting something it never
claimed to support.

Not silly to say "well, then that compiler can't compile my code."
Because I wouldn't try to compile it with a FORTRAN compiler, either.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Daniel Silverstone
On Thu, May 23, 2013 at 02:47:43PM -0700, Coda Highland wrote:
> MSVC doesn't support C99 because it's a C++ compiler. The languages
> are divergent. (It DOES support C++11 pretty well.) Sort of silly to
> be criticizing a compiler for not supporting something it never
> claimed to support.

If it's not a C compiler, it shouldn't compile C.  (Note C is not a strict
subset of C++ and you can compile C which is not valid C++ with MSVC).

Fortunately for those who need hubbub etc compiled for WIN32, there is MINGW32.

> Not silly to say "well, then that compiler can't compile my code."
> Because I wouldn't try to compile it with a FORTRAN compiler, either.

Sigh.

D.

--
Daniel Silverstone                         http://www.digital-scurf.org/
PGP mail accepted and encouraged.            Key Id: 3CCE BABE 206C 3B69

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Roberto Ierusalimschy
In reply to this post by Coda Highland
> MSVC doesn't support C99 because it's a C++ compiler. The languages
> are divergent. (It DOES support C++11 pretty well.) Sort of silly to
> be criticizing a compiler for not supporting something it never
> claimed to support.

>From http://msdn.microsoft.com/en-us/library/vstudio/bb384838.aspx
(a site from Microsoft):

  "Visual Studio 2012
  Visual Studio includes a C compiler that you can use to create
  everything from basic C programs to Windows API applications."

In my eyes, it seems they do claim to support C.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Coda Highland
On Thu, May 23, 2013 at 3:19 PM, Roberto Ierusalimschy
<[hidden email]> wrote:

>> MSVC doesn't support C99 because it's a C++ compiler. The languages
>> are divergent. (It DOES support C++11 pretty well.) Sort of silly to
>> be criticizing a compiler for not supporting something it never
>> claimed to support.
>
> >From http://msdn.microsoft.com/en-us/library/vstudio/bb384838.aspx
> (a site from Microsoft):
>
>   "Visual Studio 2012
>   Visual Studio includes a C compiler that you can use to create
>   everything from basic C programs to Windows API applications."
>
> In my eyes, it seems they do claim to support C.
>
> -- Roberto
>

Oh my! *laugh* I hadn't seen that particular page. Good catch!

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Craig Barnes
In reply to this post by Coda Highland
On 23 May 2013 22:47, Coda Highland <[hidden email]> wrote:
> MSVC doesn't support C99 because it's a C++ compiler. The languages
> are divergent. (It DOES support C++11 pretty well.) Sort of silly to
> be criticizing a compiler for not supporting something it never
> claimed to support.

Sure, but some people may expect Lua libraries to stick to the part of C
that MSVC *does* support, regardless of what it claims to be. I was
simply making it clear in case it went against anyone's expectations.
I know very little about MSVC but it wasn't my intention to get into a
political debate.

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Craig Barnes
In reply to this post by Roberto Ierusalimschy
On 23 May 2013 23:19, Roberto Ierusalimschy <[hidden email]> wrote:

> >From http://msdn.microsoft.com/en-us/library/vstudio/bb384838.aspx
> (a site from Microsoft):
>
>   "Visual Studio 2012
>   Visual Studio includes a C compiler that you can use to create
>   everything from basic C programs to Windows API applications."
>
> In my eyes, it seems they do claim to support C.
>
> -- Roberto
>

Ah, thanks for clearing that up. I was thinking the whole time that
MSVC did claim to support C, although it was stated so vehemently
that I started to question myself.

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

William Ahern
On Thu, May 23, 2013 at 11:45:35PM +0100, Craig Barnes wrote:

> On 23 May 2013 23:19, Roberto Ierusalimschy <[hidden email]> wrote:
> > >From http://msdn.microsoft.com/en-us/library/vstudio/bb384838.aspx
> > (a site from Microsoft):
> >
> >   "Visual Studio 2012
> >   Visual Studio includes a C compiler that you can use to create
> >   everything from basic C programs to Windows API applications."
> >
> > In my eyes, it seems they do claim to support C.
> >
> > -- Roberto
> >
>
> Ah, thanks for clearing that up. I was thinking the whole time that
> MSVC did claim to support C, although it was stated so vehemently
> that I started to question myself.

They support C95, but have publicly stated they have no intention to ever
support C99, C11, or any newer C standard. For all intents and purposes,
their C compiler is dead. Their C95 support is complete, so all they do is
make sure nothing breaks.

Occassionaly you do see C99-like features, such as variable argument macros,
the __restrict qualifier, and most recently stdint.h, but these come via
their inclusion in the C++ standard. And sadly C++ has declined to adopt
most of what's new in C99 and C11, including named initializers or compound
literals. In fact, the specifications for <stdbool.h> clash, and C++ refused
to even support the _Bool type native to C.

Basically, C and C++ have diverged. Microsoft knows this, and has basically
chosen to abandon C, though maintaining backward compatability for older
code.


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Craig Barnes
On 24 May 2013 00:36, William Ahern <[hidden email]> wrote:

> On Thu, May 23, 2013 at 11:45:35PM +0100, Craig Barnes wrote:
>> On 23 May 2013 23:19, Roberto Ierusalimschy <[hidden email]> wrote:
>> > >From http://msdn.microsoft.com/en-us/library/vstudio/bb384838.aspx
>> > (a site from Microsoft):
>> >
>> >   "Visual Studio 2012
>> >   Visual Studio includes a C compiler that you can use to create
>> >   everything from basic C programs to Windows API applications."
>> >
>> > In my eyes, it seems they do claim to support C.
>> >
>> > -- Roberto
>> >
>>
>> Ah, thanks for clearing that up. I was thinking the whole time that
>> MSVC did claim to support C, although it was stated so vehemently
>> that I started to question myself.
>
> They support C95, but have publicly stated they have no intention to ever
> support C99, C11, or any newer C standard. For all intents and purposes,
> their C compiler is dead. Their C95 support is complete, so all they do is
> make sure nothing breaks.
>
> Occassionaly you do see C99-like features, such as variable argument macros,
> the __restrict qualifier, and most recently stdint.h, but these come via
> their inclusion in the C++ standard. And sadly C++ has declined to adopt
> most of what's new in C99 and C11, including named initializers or compound
> literals. In fact, the specifications for <stdbool.h> clash, and C++ refused
> to even support the _Bool type native to C.
>
> Basically, C and C++ have diverged. Microsoft knows this, and has basically
> chosen to abandon C, though maintaining backward compatability for older
> code.
>

Isn't this almost exactly the point that was being alluded to, before these
tangential points started?

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

William Ahern
On Fri, May 24, 2013 at 12:58:20AM +0100, Craig Barnes wrote:
<snip>
> > Basically, C and C++ have diverged. Microsoft knows this, and has basically
> > chosen to abandon C, though maintaining backward compatability for older
> > code.
> >
>
> Isn't this almost exactly the point that was being alluded to, before these
> tangential points started?

I dunno. Things got confusing because people were conflating terms and
standards. "C" is really ambiguous in practice--it could mean any of the C
standards or implementations, or it could mean the imaginary "C" language
grammar that some people erroneously believe C++ supports.


Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

steve donovan
In reply to this post by Daniel Silverstone
On Thu, May 23, 2013 at 4:20 PM, Daniel Silverstone <[hidden email]> wrote:
Parsing HTML is very hard to get right, so you're unlikely to find a pure Lua
version of a parser.

If it's well-formed HTML (and this is often a big 'if') then Penlight's xml module has an HTML mode

local xml = require 'pl.xml'
xml.parsehtml = true
local doc = xml.parse(file,true,true) -- true for 'file not string', true for 'pure Lua parser'

The result is then in luaexpat's LOM format; this can be processed using other functionality of pl.xml like pattern matching.

This mode knows about the HTML elements than can be empty, like <br>.  (<p> isn't empty!) and case-insensitivity.

Possibly of marginal use in the crappy world of real-world HTML, but does manage to cope with real pages that keep to the standard.

(I still don't like the fact that 'parsehtml' is basically a global mode but everything is a work in progress)

steve d.



Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Rob Kendrick-2
On Fri, May 24, 2013 at 09:53:27AM +0200, steve donovan wrote:

> On Thu, May 23, 2013 at 4:20 PM, Daniel Silverstone <
> [hidden email]> wrote:
>
> > Parsing HTML is very hard to get right, so you're unlikely to find a pure
> > Lua
> > version of a parser.
> >
>
> If it's well-formed HTML (and this is often a big 'if') then Penlight's xml
> module has an HTML mode

One of the strengths of libhubbub is that it parses good and bad HTML.
It follows the HTML5 spec, which essentially specificies how MSIE parses
broken HTML.

B.

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

steve donovan
On Fri, May 24, 2013 at 10:42 AM, Rob Kendrick <[hidden email]> wrote:
One of the strengths of libhubbub is that it parses good and bad HTML.
It follows the HTML5 spec, which essentially specificies how MSIE parses
broken HTML.

It must have lots of if statements ;)

Which is precisely why it sounds like a good industrial solution; I don't doubt it can be done well with LPeg but there would need to be hundreds of tests before it would be a contender.

Reply | Threaded
Open this post in threaded view
|

Re: HTML Parser Recommendation

Chris Datfung
In reply to this post by Craig Barnes
On Thu, May 23, 2013 at 6:35 PM, Craig Barnes <[hidden email]> wrote:
On 23 May <a href="tel:2013" value="+9722013">2013 15:28, Rob Kendrick <[hidden email]> wrote:
> On Thu, May 23, <a href="tel:2013" value="+9722013">2013 at 05:15:27PM +0300, Chris Datfung wrote:
>> I'm looking for a recommendation for an HTML parser that is supported in
>> both lua 5.1 and 5.2.
>
> Do you need a tree builder, too?
>
> Perhaps this: https://github.com/craigbarnes/lua-hubbub

Hi Craig,

I'm trying to install lua-hubbub on a Debian system and there seems to be an endless number of missing dependencies. Any suggestions on installing libhubbub on Debian?

Thanks
Chris
 
>
> B.
>

That binding is still incomplete at the moment but I will
get around to finishing it next week or so. I've been mindful
of 5.2 support, so it should be trivial to port. Both the
Hubbub internals and the API/binding use C99, so MSVC
is not supported, if that matters.


12