Patterns: Why are anchors not character classes?

classic Classic list List threaded Threaded
49 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Patterns: Why are anchors not character classes?

John Hind
I've just had another bout of wrestling with patterns, probably the most
obscurantist part of Lua and the only bit I've never been able to learn and
always have to get "Programming in Lua" off the shelf. Take the simple case
of parsing a comma-separated list, the kludge I keep coming back to is
concatenating a terminating comma onto the source string before applying the
pattern:

    ps = "1,22,33,4444"
    ps = ps .. ","
    for n in ps:gmatch("(%d+),") do
        print(ls)
    end

If the start and end anchors behaved like character classes we could do
better. For regularity they probably should be %^ and %$ rather than the
existing special characters (and yes, I did have to look those up - as far
as I can see the symbols have absolutely no mnemonic value). Then we could
write:

    ps = "1,22,33,4444"
    for n in ps:gmatch("(%d+)[,%$]") do
        print(ls)
    end

The other thing I'd like to see is some "standard" patterns as constants in
the string library for the common functions that a C programmer for example
will miss ("trimright", "trimleft" and "trim", for example).



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Parke
On Tue, Jul 14, 2015 at 9:21 AM, John Hind <[hidden email]> wrote:

> I've just had another bout of wrestling with patterns, probably the most
> obscurantist part of Lua and the only bit I've never been able to learn and
> always have to get "Programming in Lua" off the shelf. Take the simple case
> of parsing a comma-separated list, the kludge I keep coming back to is
> concatenating a terminating comma onto the source string before applying the
> pattern:
>
>     ps = "1,22,33,4444"
>     ps = ps .. ","
>     for n in ps:gmatch("(%d+),") do
>         print(ls)
>     end

Why not:

for s in ('1,22,33,4444') : gmatch '%d+' do  print ( s )  end

or

for s, c in ('1,22,33,4444') : gmatch '(%d+)(,?)' do  print ( s, c )  end

If you need more than either of the above, have you tried LPeg?

http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html

-Parke

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Paul K-2
In reply to this post by John Hind
Hi John,

> Take the simple case of parsing a comma-separated list, the kludge I keep coming back to is concatenating a terminating comma onto the source string before applying the pattern:
...
> for n in ps:gmatch("(%d+)[,%$]") do

I ran into this couple of times and found that the frontier pattern
can do this job:

ps:gmatch("(%d+)%f[^%d]")

Paul.

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Parke
On Tue, Jul 14, 2015 at 9:39 AM, Paul K <[hidden email]> wrote:

> ps:gmatch("(%d+)%f[^%d]")

When will the above be different from ps:gmatch("%d+") ?

-Parke

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Javier Guerra Giraldez
On Tue, Jul 14, 2015 at 12:08 PM, Parke <[hidden email]> wrote:
> On Tue, Jul 14, 2015 at 9:39 AM, Paul K <[hidden email]> wrote:
>
>> ps:gmatch("(%d+)%f[^%d]")
>
> When will the above be different from ps:gmatch("%d+") ?


when a more ethical Lua gets developed, and we get "moderately greedy"
patterns that consume only a fair share of characters.  :-)

--
Javier

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Paul K-2
>>> ps:gmatch("(%d+)%f[^%d]")
>> When will the above be different from ps:gmatch("%d+") ?
> when a more ethical Lua gets developed, and we get "moderately greedy"
> patterns that consume only a fair share of characters.  :-)

Indeed ;). I was thinking about a more complex pattern for extracting
numbers, which in this case wasn't needed:

ps = "1,22,33,4444,abc123def"
for n in ps:gmatch("%f[%w](%d+)%f[%W]") do
  print(n)
end

This would only extract "standalone" numbers (and not those that are
part of words).

Paul.

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Jonathan Goble
On Tue, Jul 14, 2015 at 1:42 PM, Paul K <[hidden email]> wrote:

> Indeed ;). I was thinking about a more complex pattern for extracting
> numbers, which in this case wasn't needed:
>
> ps = "1,22,33,4444,abc123def"
> for n in ps:gmatch("%f[%w](%d+)%f[%W]") do
>   print(n)
> end
>
> This would only extract "standalone" numbers (and not those that are
> part of words).

Conversely, if you just simply want everything separated by commas no
matter what, and there's no chance of an escaped comma that is
actually part of what you want to capture, then this simple pattern
will do the trick:

ps = "1,22,33,4444,abc123def"
for item in ps:gmatch("[^,]+") do
    print(item)
end

Output:
1
22
33
4444
abc123def

This essentially just captures each consecutive run of stuff other than commas.

I would argue that the string library ought to have a proper split
function, but it's not too difficult to write one in Lua. I'll leave
it as an exercise for the reader, and I think there's a whole page on
the wiki about it also.

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Peter Aronoff
Jonathan Goble <[hidden email]> wrote:
> I would argue that the string library ought to have a proper split function,
> but it's not too difficult to write one in Lua. I'll leave it as an exercise
> for the reader, and I think there's a whole page on the wiki about it also.

Related: https://bitbucket.org/telemachus/string-split. And, yes, the wiki was
very helpful. (I should be releasing this to luarocks in the next week.)

--
We have not been faced with the need to satisfy someone else's
requirements, and for this freedom we are grateful.
    Dennis Ritchie and Ken Thompson, The UNIX Time-Sharing System

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Geoff Leyland
In reply to this post by Paul K-2
> I ran into this couple of times and found that the frontier pattern
> can do this job:
>
> ps:gmatch("(%d+)%f[^%d]")

I’m sure you know this Paul, but if you are reading CSV and you wish to know which field you’re reading, the frontier pattern breaks down when there are empty fields like in "1,2,,4,5"


Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Paul K-2
>> ps:gmatch("(%d+)%f[^%d]")

> I’m sure you know this Paul, but if you are reading CSV and you wish to know which field you’re reading, the frontier pattern breaks down when there are empty fields like in "1,2,,4,5"

Noted. In this case the frontier pattern wasn't doing anything useful,
as we just discussed earlier in the thread.

Paul.


Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Dirk Laurie-2
In reply to this post by John Hind
2015-07-14 18:21 GMT+02:00 John Hind <[hidden email]>:
> I've just had another bout of wrestling with patterns, probably the most
> obscurantist part of Lua and the only bit I've never been able to learn and
> always have to get "Programming in Lua" off the shelf.

Oh, I just use the online manual and pick the last string function in
the index. The patterns come immediately after. I can't remember
all of them either. :-)

> Take the simple case of parsing a comma-separated list

That's [^,]+ or [^,]* depending on whether empties are allowed.

> If the start and end anchors behaved like character classes we could do
> better. For regularity they probably should be %^ and %$

An interesting notion. I'll probably use it.

> The other thing I'd like to see is some "standard" patterns as constants in
> the string library for the common functions that a C programmer for example
> will miss ("trimright", "trimleft" and "trim", for example).

Well, now. A little module that monkeypatches those into the string library
is surely not too hard?

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

John Hind
In reply to this post by John Hind
Thanks everyone, this has advanced my understanding of patterns.

The [^,]+ pattern is good and I'll probably use that.

The frontier pattern, I'd missed completely probably because it post-dates
my edition of "Programming in Lua" - as Dirk says I should use the
electronic documentation. However the documentation itself is interesting:

"%f[set], a frontier pattern; such item matches an empty string at any
position such that the next character belongs to set and the previous
character does not belong to set. The set set is interpreted as previously
described. The beginning and the end of the subject are handled as if they
were the character '\0'."

Here the beginning and end are not just character classes but are
(unnecessarily) given an explicit byte encoding breaking the "8-bit clean"
rule for strings. There would be no need for them to have byte encodings if
beginning and end were separate character classes.

Having explicit and distinct character classes for "beginning of subject"
and "end of subject" would regularise and formalise the conceptual framework
as well as adding practical expressiveness.



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Dirk Laurie-2
2015-07-15 11:26 GMT+02:00 John Hind <[hidden email]>:

>
> "%f[set], a frontier pattern; such item matches an empty string at any
> position such that the next character belongs to set and the previous
> character does not belong to set. The set set is interpreted as previously
> described. The beginning and the end of the subject are handled as if they
> were the character '\0'."
>
> Here the beginning and end are not just character classes but are
> (unnecessarily) given an explicit byte encoding breaking the "8-bit clean"
> rule for strings. There would be no need for them to have byte encodings if
> beginning and end were separate character classes.
>
> Having explicit and distinct character classes for "beginning of subject"
> and "end of subject" would regularise and formalise the conceptual framework
> as well as adding practical expressiveness.

Something like this: "%F[bos][set][eos]"?

At some stage, LPEG becomes the proper tool to use, rather than
duplicating its advanced features in the string library.

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Brigham Toskin
Slightly off topic, but since it's been brought up a few times in this thread... Has anyone out there written a "human-readable" patch/wrapper for LPEG? Every time I try to look at a non-trivial example, my brain starts to melt.

On Wed, Jul 15, 2015 at 5:14 AM, Dirk Laurie <[hidden email]> wrote:
2015-07-15 11:26 GMT+02:00 John Hind <[hidden email]>:
>
> "%f[set], a frontier pattern; such item matches an empty string at any
> position such that the next character belongs to set and the previous
> character does not belong to set. The set set is interpreted as previously
> described. The beginning and the end of the subject are handled as if they
> were the character '\0'."
>
> Here the beginning and end are not just character classes but are
> (unnecessarily) given an explicit byte encoding breaking the "8-bit clean"
> rule for strings. There would be no need for them to have byte encodings if
> beginning and end were separate character classes.
>
> Having explicit and distinct character classes for "beginning of subject"
> and "end of subject" would regularise and formalise the conceptual framework
> as well as adding practical expressiveness.

Something like this: "%F[bos][set][eos]"?

At some stage, LPEG becomes the proper tool to use, rather than
duplicating its advanced features in the string library.




--
Brigham Toskin
Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Paul K-2
> Slightly off topic, but since it's been brought up a few times in this thread... Has anyone out there written a "human-readable" patch/wrapper for LPEG? Every time I try to look at a non-trivial example, my brain starts to melt.

There is "re", which gives you access to most (all?) functionality of
LPEG while following a more familiar syntax. Even though I'm very
familiar with regexps, I now find LPEG expressions easier to read than
corresponding "re" scripts, but it's probably a matter of experience.

I haven't seen any human-readable wrapper, but if you are interested
at what happens during execution of LPEG patterns, you may find
PegDebug useful (https://github.com/pkulchenko/PegDebug).

Paul.

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Parke
On Wed, Jul 15, 2015 at 5:20 PM, Paul K <[hidden email]> wrote:
>> Slightly off topic, but since it's been brought up a few times in this thread... Has anyone out there written a "human-readable" patch/wrapper for LPEG? Every time I try to look at a non-trivial example, my brain starts to melt.
>
> There is "re", which gives you access to most (all?) functionality of
> LPEG while following a more familiar syntax. Even though I'm very
> familiar with regexps, I now find LPEG expressions easier to read than
> corresponding "re" scripts, but it's probably a matter of experience.

I enjoy using LPeg's re module extensively.

http://www.inf.puc-rio.br/~roberto/lpeg/re.html

There are some things you cannot do directly in re.  For example, back
captures.  (As opposed to back references, which can be done in re.)

http://www.inf.puc-rio.br/~roberto/lpeg/#cap-b

Indirectly, you can use LPeg to create a back capture, and then use
that pre-created back capture via the %name syntax in re.

-Parke

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

John Hind
In reply to this post by John Hind
Wed, 15 Jul 2015 14:14:03 +0200 Dirk Laurie <[hidden email]>:


>> Having explicit and distinct character classes for "beginning of subject"
>> and "end of subject" would regularise and formalise the conceptual
>> framework as well as adding practical expressiveness.

>Something like this: "%F[bos][set][eos]"?
>At some stage, LPEG becomes the proper tool to use, rather than duplicating
its advanced features in the string library.

I suggest adding '%^' and '%$' as character classes. Deprecation of the
existing '^' and '$' as anchors is a policy decision, they would be
redundant, but might be retained for backward compatibility. New character
classes are unlikely to break existing code.

Normally the frontier pattern will work as before since we just want
"beginning of subject" and (more usually) "end of subject" to be treated as
'characters' which are not in the set. This way we can still use '%z' in a
frontier set to represent an actual character '\0' without the current risk
of confusing it with the "as if" character '\0' at the end of subject. But
we can also use '%^' and '%$' explicitly in any set definition (including
but not limited to the frontier pattern). So, for example, a parameter list
can be patterned with:

"%s*(%w+)%s*[,%$]"

I think this is both conceptually and in implementation simpler and cleaner
than the current definition, not an "advanced feature" at all.



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus


Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Dirk Laurie-2
In reply to this post by John Hind
2015-07-14 18:21 GMT+02:00 John Hind <[hidden email]>:

> If the start and end anchors behaved like character classes we could do
> better. For regularity they probably should be %^ and %$ rather than the
> existing special characters

We have had quite a pleasant little discussion in this thread, and not
much has been left unsaid, but I would like to return to the point that
after all deals with the actual subject.

We are dealing with two different concepts here.

1. Anchors ^ and $, implemented. These tell the find/match/substitute
routines that the pattern should only be tested for at respectively the
beginning or the end of the subject.

2. Character classes %^ and %$, proposed. These represent an empty
string at respectively the beginning or the end of the subject. This breaks
the rule that %x means literally x when x is non-alphanumeric, which
IMHO overrules the regularity argument, so I would prefer to denote
these, too, by just ^ and $.

It seems adequate to allow this usage of ^ and $ as set components only,
not as character classes that may appear anywhere. There is the difficulty
that ^ is also the set complement character.
[As I type this, a new post by John Hind has arrived, and readers will have
read it before getting here. I shall finish this post without having done
more than glance at that, even though some points overlap.]

The proposal, whether with or without, does not seem terribly hard
to implement. We could get some hands-on experience on them
instead of making the capital mistake of theorizing without data.
Power patch, anyone?

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Dirk Laurie-2
2015-07-16 12:16 GMT+02:00 Dirk Laurie <[hidden email]>:
> 2015-07-14 18:21 GMT+02:00 John Hind <[hidden email]>:
>> If the start and end anchors behaved like character classes we could do
>> better.
> These represent an empty string at respectively the beginning or the end
> of the subject. ... The proposal ... does not seem terribly hard to implement.
> We could get some hands-on experience on them instead of making the
> capital mistake of theorizing without data. Power patch, anyone?

Taking up my own challenge, I discovered that it is much trickier than
I thought. `[set]` is hardcoded to match one character of the subject or
to report no match. So matching an empty string at the beginning or
end of the subject requires a redesign of more than just the function
`matchbracketclass`. That's out of my league.

Reply | Threaded
Open this post in threaded view
|

Re: Patterns: Why are anchors not character classes?

Coda Highland
In reply to this post by Dirk Laurie-2
On Thu, Jul 16, 2015 at 3:16 AM, Dirk Laurie <[hidden email]> wrote:
> It seems adequate to allow this usage of ^ and $ as set components only,
> not as character classes that may appear anywhere. There is the difficulty
> that ^ is also the set complement character.

There's also the difficulty that existing code may well be looking for
something like [!@#$%^&*()] and therefore change behavior, with no
warning possible and potentially a long time of remaining hidden
before causing problems.

/s/ Adam

123