LPeg support for utf-8

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

LPeg support for utf-8

Roberto Ierusalimschy
A quick survey, for those who care:
- should LPeg support utf-8?
- If so, what would that mean?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Marc Balmer
Am 01.04.11 19:29, schrieb Roberto Ierusalimschy:
> A quick survey, for those who care:
> - should LPeg support utf-8?

Yes, please.  In fact, all of Lua should... (at least that's what _I_
would like to see, purely egoistic standpoint, I know...)

> - If so, what would that mean?

On the C level, quite a lot.  strlen() and friends can no longer be
used, printf format strings like "%20s" don't work anymore etc.  Not to
speak about string comparison, collation etc.  Since I am not familiar
with LPeg's implementation, that is about all I can say.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Peter Odding-3
In reply to this post by Roberto Ierusalimschy
Hi Roberto,

> A quick survey, for those who care:
> - should LPeg support utf-8?
> - If so, what would that mean?

I love LPeg but don't see anything useful it could do with UTF-8 that it
doesn't already do. LPeg already handles parsing UTF-8 fine (for those
who don't know: UTF-8 is a superset of ASCII). Any built-in "magic"
would only reduce the flexibility for users of LPeg, unless you're
considering an add-on module like "re". That would be fine of course but
I don't really see the need for it, given that modules like slnunicode
are available.

 - Peter Odding

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Peter Cawley
In reply to this post by Marc Balmer
On Fri, Apr 1, 2011 at 6:37 PM, Marc Balmer <[hidden email]> wrote:

> Am 01.04.11 19:29, schrieb Roberto Ierusalimschy:
>> A quick survey, for those who care:
>> - should LPeg support utf-8?
>
> Yes, please.  In fact, all of Lua should... (at least that's what _I_
> would like to see, purely egoistic standpoint, I know...)
>
>> - If so, what would that mean?
>
> On the C level, quite a lot.  strlen() and friends can no longer be
> used, printf format strings like "%20s" don't work anymore etc.  Not to
> speak about string comparison, collation etc.  Since I am not familiar
> with LPeg's implementation, that is about all I can say.

I interpreted the question to be more about semantics than
implementation, e.g. what should lpeg.P(n) mean for integer n? what
should lpeg.R and lpeg.S do? etc.

My first thoughts are that if you want builtin utf8 support, then you
can write a thin wrapper around lpeg in Lua.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

David Given
In reply to this post by Marc Balmer
On 01/04/11 17:37, Marc Balmer wrote:
[...]
> On the C level, quite a lot.  strlen() and friends can no longer be
> used, printf format strings like "%20s" don't work anymore etc.  Not to
> speak about string comparison, collation etc.  Since I am not familiar
> with LPeg's implementation, that is about all I can say.

Determining the length of a Unicode string is a pretty fuzzy concept
anyway --- AFAIK the only way to do it is to break it up into grapheme
clusters and determine the size of each grapheme cluster individually
(which may vary according to font).

I tend to use a cheap and nasty mechanism for console applications that
assumes that each code point is a grapheme cluster, and then uses a set
of rules to decide whether they're of width 1 and 2. This works most of
the time but not all of the time. See:

http://wordgrinder.hg.sourceforge.net/hgweb/wordgrinder/wordgrinder/file/f658d1e8f1f3/src/c/emu/wcwidth.c

In terms of what I'd like from LPEG is a set of primitives for matching
a single code point and a single grapheme cluster (treating them as Lua
strings, i.e. sequences of bytes). This would allow easier parsing of
UTF-8 strings. The collation stuff might be useful but not only is it
hideously complicated and involving massive tables, but I've never
actually found a need for it, so I'd willing to live without it.

--
┌─── dg@cowlark.com ───── http://www.cowlark.com ─────
│ "I have always wished for my computer to be as easy to use as my
│ telephone; my wish has come true because I can no longer figure out
│ how to use my telephone." --- Bjarne Stroustrup


signature.asc (270 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Tony Finch
In reply to this post by Roberto Ierusalimschy
Roberto Ierusalimschy <[hidden email]> wrote:

> A quick survey, for those who care:
> - should LPeg support utf-8?
> - If so, what would that mean?

An alternative to lpeg.P(N) which matches N UTF-8 encoded code points
instead of octets. Similarly, alternatives to lpeg.R and lpeg.S that deal
with code points instead of octets. Maybe lpeg.uP and .uS and .uR ?
Perhaps there should be a .uB as well. I would prefer this to a "unicode
mode" which changes the behaviour of the existing funcctions.

Tony.
--
f.anthony.n.finch  <[hidden email]>  http://dotat.at/
Bailey: Southwest 5 to 7, veering west 4 or 5. Rough or very rough. Showers.
Moderate or good.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Roberto Ierusalimschy
In reply to this post by Peter Odding-3
> Hi Roberto,
>
> > A quick survey, for those who care:
> > - should LPeg support utf-8?
> > - If so, what would that mean?
>
> I love LPeg but don't see anything useful it could do with UTF-8 that it
> doesn't already do. LPeg already handles parsing UTF-8 fine (for those
> who don't know: UTF-8 is a superset of ASCII). Any built-in "magic"
> would only reduce the flexibility for users of LPeg, unless you're
> considering an add-on module like "re". That would be fine of course but
> I don't really see the need for it, given that modules like slnunicode
> are available.

The support for UTF-8 would not change current "byte-oriented"
behavior.  I am thinking more in terms of extra build-in patterns.
So, for instance, lpeg.utf8.point(n) would match n UTF-8 code points,
lpeg.utf8.set("...") would match any point present in the given string,
and lpeg.utf8.range(v1,v2) would match any point with a code between v1
and v2.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Roberto Ierusalimschy
In reply to this post by Tony Finch
> Roberto Ierusalimschy <[hidden email]> wrote:
>
> > A quick survey, for those who care:
> > - should LPeg support utf-8?
> > - If so, what would that mean?
>
> An alternative to lpeg.P(N) which matches N UTF-8 encoded code points
> instead of octets. Similarly, alternatives to lpeg.R and lpeg.S that deal
> with code points instead of octets. Maybe lpeg.uP and .uS and .uR ?
> Perhaps there should be a .uB as well. I would prefer this to a "unicode
> mode" which changes the behaviour of the existing funcctions.

This is more ore less what I had in mind (specific names not
withstanding). But still remains the question of whether each of these
constructions (uS, uR, etc.) is really useful and whether there should
be others. For instance, would it be worth to support something like
properties (using wctype)? Or a capture that matches one code point
and catures its value?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Chris Babcock
On Fri, Apr 1, 2011 at 11:28 AM, Roberto Ierusalimschy
<[hidden email]> wrote:

>> Roberto Ierusalimschy <[hidden email]> wrote:
>>
>> > A quick survey, for those who care:
>> > - should LPeg support utf-8?
>> > - If so, what would that mean?
>>
>> An alternative to lpeg.P(N) which matches N UTF-8 encoded code points
>> instead of octets. Similarly, alternatives to lpeg.R and lpeg.S that deal
>> with code points instead of octets. Maybe lpeg.uP and .uS and .uR ?
>> Perhaps there should be a .uB as well. I would prefer this to a "unicode
>> mode" which changes the behaviour of the existing funcctions.
>
> This is more ore less what I had in mind (specific names not
> withstanding). But still remains the question of whether each of these
> constructions (uS, uR, etc.) is really useful and whether there should
> be others. For instance, would it be worth to support something like
> properties (using wctype)? Or a capture that matches one code point
> and catures its value?
>
> -- Roberto
>


Right. Lua and LPeG do what is expected with UTF-8, which is to say
that they don't break the naive support that is the defining feature
of UTF-8. So UTF-8 support isn't as much a goal as doing Unicode
support in a platform independent way that includes those platforms
whose system encoding happens to be UTF-8. Since that will break
applications expecting the naive byte stream approach, Unicode
equivalents of the current functions are the right approach.

Since LPeG is concerned with semantics rather than presentation, the
code point is the right unit for captures and counting. Given the Lua
implementation of numbers as floats, using UCS-4 for the internal
representation is probably "the Lua thing to do"... and it would save
Lua-l from the "should we migrate from UCS-2 to UCS-4" discussion at
some point in the future. ;)

The best approach for the community would probably be a Unicode
library that provides a string API, imports basic UTF character sets,
and provides classes that store and manipulate Unicode text as either
UCS-2 or UCS-4. Java, Python and .NET manage to get along just fine
with UCS-2, which uses half the memory of UCS-4 for most applications
but involves a bit of ugly for character sets that go outside the
Basic Multilingual Plane. That's implementation detail, though. This
year memory is cheap and we aren't concerned with Unicode on too many
16 bit processors so my preference leans towards UCS-4, but I think it
mostly comes down to how the implementor wants to handle characters
outside the BMP. Either way, if LPeG Unicode functions relied on a
Unicode string API then that API would likely be robust enough for
most other Unicode string applications. By the same token,
community-supplied implementations that use the same API should work
with LPeG when someone gets the itch for an implementation that is
better optimized for their application.

And why I say "string API":

http://unicode.org/faq/utf_bom.html#utf32-4

Unless you want to get bogged down in formal details that are not
likely to be relevant to a working grammar, "c" is a string rather
than a character and it generates a match on a word containing "ch"
even in a Czech or Spanish locale.

Chris
--
Yippee-ki-yay, coffee maker.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Miles Bader-2
In reply to this post by Roberto Ierusalimschy
Roberto Ierusalimschy <[hidden email]> writes:
> A quick survey, for those who care:
> - should LPeg support utf-8?
> - If so, what would that mean?

The user can easily support UTF-8 input in a parser already; what else
would be needed?

Maybe one thing to do would be to look at a typical "utf-8 string"
grammar,  see if there are any obvious inefficiencies in the resulting
parser, and try to come up with ways to address them?

-Miles

--
gravity a demanding master ... soft soft snow

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Tony Finch
In reply to this post by Roberto Ierusalimschy
Roberto Ierusalimschy <[hidden email]> wrote:
>
> This is more ore less what I had in mind (specific names not
> withstanding). But still remains the question of whether each of these
> constructions (uS, uR, etc.) is really useful and whether there should
> be others.

I think the main thing that is difficult for an lpeg user to do and which
lpeg itself should do reasonably well is union and intersection on sets of
unicode code points encoded in UTF-8. Hence uS and uR being my first
suggestions.

At the moment lpeg optimizes certain pattern combinations into character
set operations. I wonder if the pattern optimizer is powerful enough to do
this efficiently for patterns that match UTF-8 encoded unicode character
sets, or if it needs some extra logic to optimize matches across sets of
the next 1-4 octets.

> For instance, would it be worth to support something like properties
> (using wctype)?

I didn't suggest this because I don't think standard APIs give you
convenient access to a unicode character properties table, which would
make it difficult to compile character property matches efficiently.

Perhaps lpeg could come with a separate sub-module containing patterns for
unicode properties - separated out for size reasons. PCRE's ucptable.c is
over 440 KiBytes source and compiles to 88.5 KiBytes data.

> Or a capture that matches one code point and catures its value?

Could be useful, yes.

Brief rant follows: One thing I really hate about the standards for UTF-8
is that they require applications to break (e.g. lose data or abort) when
they encounter an invalid code point. This is very very bad. It would be
much better if there were some code points reserved for invalid octets in
UTF-8 data, so that if an app were mistakenly given a binary data stream
or some slightly corrupt data or some ISO 8859 text, it could at least
pass it through unscathed. Bah.

Tony.
--
f.anthony.n.finch  <[hidden email]>  http://dotat.at/
South-east Iceland: Variable 4, becoming easterly or southeasterly, then
cyclonic 6 to gale 8, occasionally severe gale 9 later. Very rough or high.
Rain or squally showers. Moderate or good.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Tony Finch
In reply to this post by Chris Babcock
Chris Babcock <[hidden email]> wrote:
>
> Since LPeG is concerned with semantics rather than presentation, the
> code point is the right unit for captures and counting. Given the Lua
> implementation of numbers as floats, using UCS-4 for the internal
> representation is probably "the Lua thing to do"...

No, keep strings as UTF-8 blobs. Erlang has found that this is the way to
keep things efficient, since most data you are dealing with does not need
serious per-codepoint analysis. lpeg can parse at the byte level and
identify the parts of the string that need more intensive processing.

Tony.
--
f.anthony.n.finch  <[hidden email]>  http://dotat.at/
Humber, Thames, Dover, Wight, Portland, Plymouth, North Biscay: Westerly or
southwesterly 3 or 4, increasing 5 to 7 later. Slight or moderate,
occasionally rough in Plymouth and north Biscay. Mainly fair. Moderate or
good.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Thomas Harning Jr.
In reply to this post by Roberto Ierusalimschy
On Fri, Apr 1, 2011 at 2:28 PM, Roberto Ierusalimschy
<[hidden email]> wrote:

>> Roberto Ierusalimschy <[hidden email]> wrote:
>>
>> > A quick survey, for those who care:
>> > - should LPeg support utf-8?
>> > - If so, what would that mean?
>>
>> An alternative to lpeg.P(N) which matches N UTF-8 encoded code points
>> instead of octets. Similarly, alternatives to lpeg.R and lpeg.S that deal
>> with code points instead of octets. Maybe lpeg.uP and .uS and .uR ?
>> Perhaps there should be a .uB as well. I would prefer this to a "unicode
>> mode" which changes the behaviour of the existing funcctions.
>
> This is more ore less what I had in mind (specific names not
> withstanding). But still remains the question of whether each of these
> constructions (uS, uR, etc.) is really useful and whether there should
> be others. For instance, would it be worth to support something like
> properties (using wctype)? Or a capture that matches one code point
> and catures its value?
I think the utilities of uP, uS and uR would be very valuable. Right
now in LuaJSON, I have a set of hand-prepared UTF-8 character
sequences to attempt to optimize a pattern matching a series of
whitespace characters split through the Unicode table.  uS and uR,
while complicated, could offer an efficient matching set that would
optimize the output pattern based on the desired codepoints.

Ex:
chr = string.char
P, S = lpeg.P, lpeg.S
a match for both U+0085 and U+00A0 could effectively do:
  P(chr(0xC2)) * S(chr(0x85) .. chr(0xA0))
... or whatever is the most optimal means of matching a common prefix.
a match for common prefixes and intermediate characters could also be done...

As for cost of implementation, I haven't entirely worked out a mental
model of how one would take a set of N codepoints or a range and find
the optimal pattern... unless there is an easier mechanism to
implement directly in C.

As for suggestions on implementation... I would recommend the following:
 uP (codepoint) - receive codepoint as an integer and return the
appropriate literal match
 uR (start_codepoint1, end_codepoint1, (start_codepointx,
end_codepointx)*) - receive pairs of codepoints as integers and create
an optimal range match
 uR (table of codepoint pairs) - effectively the same as uR with the
table unpacked... not sure if this would be important enough
 uS (codepoint_1, (codepoint_x)*) - returns a set match on each of the
listed codepoints
 uS (table of codepoints) - returns a set match on the list of
codepoints in the table

A utility to return the utf-8 representation of a given codepoint
would also be useful in testing (both LPeg itself and relying
applications) to avoid having to roll your own tool to encode utf-8.
--
Thomas Harning Jr.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Tony Finch
Thomas Harning Jr. <[hidden email]> wrote:
>
> As for suggestions on implementation... I would recommend the following:
>  uP (codepoint) - receive codepoint as an integer and return the
> appropriate literal match

This is a bit confusing since it clashes with the existing lpeg.P
function. Also it has exactly the same semantics as your suggested
lpeg.uS.

>  uR (start_codepoint1, end_codepoint1, (start_codepointx,
> end_codepointx)*) - receive pairs of codepoints as integers and create
> an optimal range match
>  uR (table of codepoint pairs) - effectively the same as uR with the
> table unpacked... not sure if this would be important enough
>  uS (codepoint_1, (codepoint_x)*) - returns a set match on each of the
> listed codepoints
>  uS (table of codepoints) - returns a set match on the list of
> codepoints in the table

All of these should take UTF-8 strings as arguments from which lpeg should
extract a list of codepoints or codepoint pairs, as in the existing lpeg.R
and lpeg.S functions.

> A utility to return the utf-8 representation of a given codepoint
> would also be useful in testing (both LPeg itself and relying
> applications) to avoid having to roll your own tool to encode utf-8.

Actually if you have this function then the expanded versions of the lpeg
functions that you suggested would be redundant. You sould just write
lpeg.uS(toutf8(8,10,13,32,0xA0,0x202F,0x205F,0x3000,0xFEFF)) + lpeg.uR(toutf8(0x2000,0x200b))

Tony.
--
f.anthony.n.finch  <[hidden email]>  http://dotat.at/
German Bight, Humber, Thames, Dover: Southwest 5 or 6, increasing 7 for a time
in German Bight, veering west 4 or 5 later. Slight or moderate, occasionally
rough in German Bight. Fog patches. Moderate or good, occasionally very poor.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Edgar Toernig
In reply to this post by Roberto Ierusalimschy
> A quick survey, for those who care:
> - should LPeg support utf-8?
> - If so, what would that mean?

I think it would help (not only LPeg) to add unicode
escape sequences (\uXXXX) to Lua's strings.

Ciao, ET.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Matthew Frazier
On 04/07/2011 03:15 PM, E. Toernig wrote:
>> A quick survey, for those who care:
>> - should LPeg support utf-8?
>> - If so, what would that mean?
>
> I think it would help (not only LPeg) to add unicode
> escape sequences (\uXXXX) to Lua's strings.
>
> Ciao, ET.
>

Lua itself does not have Unicode support. So, if you're suggesting that
this would merely generate the proper UTF-8 bytes for the
character...hmm, that's actually a pretty good idea.
--
Regards, Matthew Frazier

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Dimiter 'malkia' Stanev
In reply to this post by Tony Finch
> No, keep strings as UTF-8 blobs. Erlang has found that this is the way to
> keep things efficient, since most data you are dealing with does not need
> serious per-codepoint analysis. lpeg can parse at the byte level and
> identify the parts of the string that need more intensive processing.
>
> Tony.

Off topic, but I believe erlang stores each character as 32-bit value,
according to this:

http://schemecookbook.org/Erlang/StringBasics

"To understand why Erlang string handling is less efficient than a
language like Perl, you need to know that each character uses 8 bytes of
memory. That's right -- 8 bytes, not 8 bits! Erlang stores each
character as a 32-bit integer, with a 32-bit pointer for the next item
in the list (remember, strings are lists of characters.)"

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Tony Finch
Dimiter "malkia" Stanev <[hidden email]> wrote:

> > No, keep strings as UTF-8 blobs. Erlang has found that this is the way to
> > keep things efficient, since most data you are dealing with does not need
> > serious per-codepoint analysis. lpeg can parse at the byte level and
> > identify the parts of the string that need more intensive processing.
>
> Off topic, but I believe erlang stores each character as 32-bit value,
> according to this:
>
> http://schemecookbook.org/Erlang/StringBasics

Right, hence efficient string handling for bulk data is done with binaries
rather than the usual string type.

http://www.erlang.org/doc/efficiency_guide/binaryhandling.html

Tony.
--
f.anthony.n.finch  <[hidden email]>  http://dotat.at/
Forties, Cromarty, Forth: West 5 to 7, occasionally gale 8 in Cromarty,
becoming variable 3 or 4 later. Slight or moderate, but rough or very rough in
Forties at first. Fair. Moderate or good, occasionally poor later.

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Patrick Rapin
In reply to this post by Matthew Frazier
> Lua itself does not have Unicode support. So, if you're suggesting that this
> would merely generate the proper UTF-8 bytes for the character...hmm, that's
> actually a pretty good idea.

I second this. This should not be complicated to implement, and from
the same vein as the "\xBB" introduction in Lua 5.2.  And, while
possible, it is not obvious to write that a feature in pure Lua.
But of course, we will have to make clear that only a UTF-8 sequence
is generated with that syntax and that's all (no UTF-16 support, nor
other Unicode handling).

Reply | Threaded
Open this post in threaded view
|

Re: LPeg support for utf-8

Tony Finch
In reply to this post by Matthew Frazier
Matthew Frazier <[hidden email]> wrote:
> On 04/07/2011 03:15 PM, E. Toernig wrote:
> >
> > I think it would help (not only LPeg) to add unicode
> > escape sequences (\uXXXX) to Lua's strings.
>
> Lua itself does not have Unicode support. So, if you're suggesting that this
> would merely generate the proper UTF-8 bytes for the character...hmm, that's
> actually a pretty good idea.

Try the attached patch (with versions for 5.1 and 5.2).

Tony.
--
f.anthony.n.finch  <[hidden email]>  http://dotat.at/
Humber, Thames: Variable 3 or 4. Slight or moderate. Fair. Moderate or good,
occasionally poor later.

lua-5.1-utf8esc.patch (3K) Download Attachment
lua-5.2-utf8esc.patch (3K) Download Attachment
12