A bug in string.gmatch and string.gsub?

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

A bug in string.gmatch and string.gsub?

Dirk Laurie-2
Nothing in the description of `gsub` prepares the user for this:

> print((string.gsub(";a;", "a*", "ITEM")))
ITEM;ITEMITEM;ITEM

A similar thing happens with `gmatch`:

> I={}; for s in (";a;"):gmatch"a*" do I[#I+1]=s end
> print(table.concat(I,'|'))
|a||

This behaviour has been reported on the Wiki[1]. Unfortunately the fix
given there, namely "[^,]+", is not a fix. Extensive experimentation
has failed to reveal any way of fiddling with the pattern that produces
`ITEM;ITEM;ITEM` and `|a|` respectively.

Is it conceivable that this behaviour is actually a bug in `gmatch`
and `gsub`?

Let's compare it to the well-known Unix utility `sed`.

…/src$ echo ";a;" | sed -e "s/a*/ITEM/g"
ITEM;ITEM;ITEM

Well, well!

`sed` seems to follow the rule "An empty match is rejected when it is
adjacent to the previous match." Whereas `lstrlib.c` follows the rule
"After an empty match, the following match must start at a later
position", which is equivalent to "An empty match is rejected when it
is adjacent to the previous match and that match was also empty."

Let's say there is no bug in `gmatch` and `gsub`; the choice of rule
is an implementation detail.

Still, if it is an implementation detail, it can be changed without
changing the language. That possibility is demonstrated in the attached
version of `lstrlib.c`.

[1] <http://lua-users.org/wiki/SplitJoin> Method: Using only string.gsub

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Alexey Melnichuk
Dirk Laurie <dirk.laurie <at> gmail.com> writes:

>
> Nothing in the description of `gsub` prepares the user for this:
>
> > print((string.gsub(";a;", "a*", "ITEM")))
> ITEM;ITEMITEM;ITEM
>
Pattern "a*" match empty string after each char and at the begin of string
Think about this :
print((string.gsub(";a;", "", "ITEM")))
May be you want : string.gsub(";a;", "a+", "ITEM") ?


Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Dirk Laurie-2
2013/4/29 Alexey Melnichuk <[hidden email]>:

> May be you want : string.gsub(";a;", "a+", "ITEM") ?

No, I don't. That is exactly what the Wiki page suggests. It gives
";ITEM;". It's more reasonable than "ITEM;ITEMITEM;ITEM", but I want
"ITEM;ITEM;ITEM". It should not be necessary, given Lua's elegant
patterns, to devote a 300-line webpage discussing how hard it is
to implement `split`.

I agree that there is a way to explain the current behaviour of "a*"
logically, BUT:

  (a) That way is based on a rule not stated in the documentation,
  (b) which is not the only possible rule consistent with it.
  (c) Another rule can be proposed,
  (d) which is consistent with the way `sed` behaves,
  (e) appears to be more intuitive,
  (f) and can be implemented quite efficiently.

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Peter Hickman-3
The reading of "a*" is to match zero or more 'a's

So
1) There are zero 'a's before the first ';' so print ITEM
2) The ';' is not an 'a' so just print ;
3) The 'a' matches 'a*' and so we print ITEM
4) After the last match, but before the second ';' there are zero 'a's so we print ITEM
5) The ';' is not an 'a' so just print ;
6) After the last ';' but before the end of the string there are zero 'a's so print ITEM

The only oddity for me is step 4 if my understanding of how this works is correct.

Of course without step 4 you would have what you wanted. To be honest I am mystified by step 4, it looks like a bug to me

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Mark Gabby

Yea, I'd agree that 4 is a bug. * is supposed to be greedy.

On Apr 29, 2013 9:04 AM, "Peter Hickman" <[hidden email]> wrote:
The reading of "a*" is to match zero or more 'a's

So
1) There are zero 'a's before the first ';' so print ITEM
2) The ';' is not an 'a' so just print ;
3) The 'a' matches 'a*' and so we print ITEM
4) After the last match, but before the second ';' there are zero 'a's so we print ITEM
5) The ';' is not an 'a' so just print ;
6) After the last ';' but before the end of the string there are zero 'a's so print ITEM

The only oddity for me is step 4 if my understanding of how this works is correct.

Of course without step 4 you would have what you wanted. To be honest I am mystified by step 4, it looks like a bug to me

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Roberto Ierusalimschy
In reply to this post by Dirk Laurie-2
> I agree that there is a way to explain the current behaviour of "a*"
> logically, BUT:
>
>   (a) That way is based on a rule not stated in the documentation,
>   (b) which is not the only possible rule consistent with it.
>   (c) Another rule can be proposed,
>   (d) which is consistent with the way `sed` behaves,
>   (e) appears to be more intuitive,
>   (f) and can be implemented quite efficiently.

I may be wrong, but it seems that the two rules can be stated like that:

1) Do not match two empty strings in the same position. (current Lua rule)

2) Do not match an empty string in the same position of another match
(not necessarily empty). (sed rule)

Is rule 2 really more intuitive in general or it just happen to do what
you want in this particular case?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Scott Morgan
In reply to this post by Peter Hickman-3
On 29/04/13 17:03, Peter Hickman wrote:

> The reading of "a*" is to match zero or more 'a's
>
> So
> 1) There are zero 'a's before the first ';' so print ITEM
> 2) The ';' is not an 'a' so just print ;
> 3) The 'a' matches 'a*' and so we print ITEM
> 4) After the last match, but before the second ';' there are zero 'a's
> so we print ITEM
> 5) The ';' is not an 'a' so just print ;
> 6) After the last ';' but before the end of the string there are zero
> 'a's so print ITEM
>
> The only oddity for me is step 4 if my understanding of how this works
> is correct.
>
> Of course without step 4 you would have what you wanted. To be honest
> I am mystified by step 4, it looks like a bug to me
>

Even odder....

Lua 5.2.2  Copyright (C) 1994-2013 Lua.org, PUC-Rio
  > =(";a;"):gsub("a-", "ITEM")
  ITEM;ITEMaITEM;ITEM    4

Why is that 'a' still there?

Scott


Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Dirk Laurie-2
2013/4/29 Scott Morgan <[hidden email]>:

> Even odder....
>
> Lua 5.2.2  Copyright (C) 1994-2013 Lua.org, PUC-Rio
>  > =(";a;"):gsub("a-", "ITEM")
>  ITEM;ITEMaITEM;ITEM    4
>
> Why is that 'a' still there?

That's an independent bug. My patched version (which corrects
the other one) still has this one.

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Finn Wilcox-2
In reply to this post by Scott Morgan
On 29/04/2013 17:17, Scott Morgan wrote:
> Even odder....
>
> Lua 5.2.2  Copyright (C) 1994-2013 Lua.org, PUC-Rio
>  > =(";a;"):gsub("a-", "ITEM")
>  ITEM;ITEMaITEM;ITEM    4
>
> Why is that 'a' still there?
>
I don't think this one is a bug.  "a-" prefers to match the empty string
if possible, and that is all it does here (i.e. matches the 5 distinct
empty substrings of ";a;".)

Finn

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Coda Highland
In reply to this post by Roberto Ierusalimschy
On Mon, Apr 29, 2013 at 9:11 AM, Roberto Ierusalimschy
<[hidden email]> wrote:

>> I agree that there is a way to explain the current behaviour of "a*"
>> logically, BUT:
>>
>>   (a) That way is based on a rule not stated in the documentation,
>>   (b) which is not the only possible rule consistent with it.
>>   (c) Another rule can be proposed,
>>   (d) which is consistent with the way `sed` behaves,
>>   (e) appears to be more intuitive,
>>   (f) and can be implemented quite efficiently.
>
> I may be wrong, but it seems that the two rules can be stated like that:
>
> 1) Do not match two empty strings in the same position. (current Lua rule)
>
> 2) Do not match an empty string in the same position of another match
> (not necessarily empty). (sed rule)
>
> Is rule 2 really more intuitive in general or it just happen to do what
> you want in this particular case?
>
> -- Roberto
>

I find rule 2 more intuitive, myself; it makes greedy quantifiers feel
more greedy, which is good.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Dirk Laurie-2
In reply to this post by Roberto Ierusalimschy
2013/4/29 Roberto Ierusalimschy <[hidden email]>:

> I may be wrong, but it seems that the two rules can be stated like that:
>
> 1) Do not match two empty strings in the same position. (current Lua rule)
>
> 2) Do not match an empty string in the same position of another match
> (not necessarily empty). (sed rule)
>
> Is rule 2 really more intuitive in general or it just happen to do what
> you want in this particular case?

It has the advantage of making `split` trivial instead of requiring the
sort of thing that takes the Lua Wiki 300 lines to explain.

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Roberto Ierusalimschy
> 2013/4/29 Roberto Ierusalimschy <[hidden email]>:
>
> > I may be wrong, but it seems that the two rules can be stated like that:
> >
> > 1) Do not match two empty strings in the same position. (current Lua rule)
> >
> > 2) Do not match an empty string in the same position of another match
> > (not necessarily empty). (sed rule)
> >
> > Is rule 2 really more intuitive in general or it just happen to do what
> > you want in this particular case?
>
> It has the advantage of making `split` trivial instead of requiring the
> sort of thing that takes the Lua Wiki 300 lines to explain.

True, but that does not make it more intuitive; it makes it more useful
in one particular case. Are there other scenarios where it is more (or
less) useful?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Roberto Ierusalimschy
In reply to this post by Dirk Laurie-2
> > Even odder....
> >
> > Lua 5.2.2  Copyright (C) 1994-2013 Lua.org, PUC-Rio
> >  > =(";a;"):gsub("a-", "ITEM")
> >  ITEM;ITEMaITEM;ITEM    4
> >
> > Why is that 'a' still there?
>
> That's an independent bug. [...]

Please stop calling "bug" something that does not behave as you wanted
or imagined.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Alexey Melnichuk
In reply to this post by Dirk Laurie-2
>
> The reading of "a*" is to match zero or more 'a's
> So
> 1) There are zero 'a's before the first ';' so print ITEM
> 2) The ';' is not an 'a' so just print ;

I think here we match empty string and write ITEM
I  recently  read  Hopcroft's  Introduction  in automata theory :). And i
think this is like eNFA. Before read each char we "read" empty string.
To  recognition  it  is  not  important when and how mach time we read that
empty string, but for substitution it is important. I think it have to
be and documented implementation detail.

> 3) The 'a' matches 'a*' and so we print ITEM4) After the last match, but before the second ';' there are zero 'a's so we print ITEM5) The ';' is not an 'a' so just print ;
> 6) After the last ';' but before the end of the string there are zero 'a's so print ITEM
>
> The only oddity for me is step 4 if my understanding of how this works is correct.
>
> Of course without step 4 you would have what you wanted. To be honest I am mystified by step 4, it looks like a bug to me
>

>  > =(";a;"):gsub("a-", "ITEM")
>  ITEM;ITEMaITEM;ITEM    4
>
> Why is that 'a' still there?
I think "a-" always math only empty string (as short as possible)

For me all this behavior are predictable. May be it is bug. but o


Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Matt Eisan
In reply to this post by Roberto Ierusalimschy
Perhaps a new rule can be made? A way to distinguish between which you want to use?


On Mon, Apr 29, 2013 at 1:05 PM, Roberto Ierusalimschy <[hidden email]> wrote:
> 2013/4/29 Roberto Ierusalimschy <[hidden email]>:
>
> > I may be wrong, but it seems that the two rules can be stated like that:
> >
> > 1) Do not match two empty strings in the same position. (current Lua rule)
> >
> > 2) Do not match an empty string in the same position of another match
> > (not necessarily empty). (sed rule)
> >
> > Is rule 2 really more intuitive in general or it just happen to do what
> > you want in this particular case?
>
> It has the advantage of making `split` trivial instead of requiring the
> sort of thing that takes the Lua Wiki 300 lines to explain.

True, but that does not make it more intuitive; it makes it more useful
in one particular case. Are there other scenarios where it is more (or
less) useful?

-- Roberto


Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Dirk Laurie-2
In reply to this post by Roberto Ierusalimschy
2013/4/29 Roberto Ierusalimschy <[hidden email]>:

>> > Even odder....
>> >
>> > Lua 5.2.2  Copyright (C) 1994-2013 Lua.org, PUC-Rio
>> >  > =(";a;"):gsub("a-", "ITEM")
>> >  ITEM;ITEMaITEM;ITEM    4
>> >
>> > Why is that 'a' still there?
>>
>> That's an independent bug. [...]
>
> Please stop calling "bug" something that does not behave as you wanted
> or imagined.

Sorry about that one: it was sent too hastily.

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Mark Gabby
In reply to this post by Roberto Ierusalimschy

Why does this behave in this way? I would expect the result to be:

ITEM;ITEMITEMITEM;ITEM 5

On Apr 29, 2013 10:09 AM, "Roberto Ierusalimschy" <[hidden email]> wrote:
> > Even odder....
> >
> > Lua 5.2.2  Copyright (C) 1994-2013 Lua.org, PUC-Rio
> >  > =(";a;"):gsub("a-", "ITEM")
> >  ITEM;ITEMaITEM;ITEM    4
> >
> > Why is that 'a' still there?
>
> That's an independent bug. [...]

Please stop calling "bug" something that does not behave as you wanted
or imagined.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

David Heiko Kolf-2
Am 29.04.2013 20:20, schrieb Mark Gabby:
> Why does this behave in this way? I would expect the result to be:
>
> ITEM;ITEMITEMITEM;ITEM 5
>
>     > > Lua 5.2.2  Copyright (C) 1994-2013 Lua.org, PUC-Rio
>     > >  > =(";a;"):gsub("a-", "ITEM")
>     > >  ITEM;ITEMaITEM;ITEM    4
>     > >
>     > > Why is that 'a' still there?

"a-" matches the shortest string possible so that the rest of the
pattern still matches. An "a-" at the end of the pattern always matches
exactly 0 characters.

Best regards,

David Kolf

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Dirk Laurie-2
In reply to this post by Roberto Ierusalimschy
>> > I may be wrong, but it seems that the two rules can be stated like that:
>> >
>> > 1) Do not match two empty strings in the same position. (current Lua rule)
>> >
>> > 2) Do not match an empty string in the same position of another match
>> > (not necessarily empty). (sed rule)
>> >
>> > Is rule 2 really more intuitive in general or it just happen to do what
>> > you want in this particular case?
>>
>> It has the advantage of making `split` trivial instead of requiring the
>> sort of thing that takes the Lua Wiki 300 lines to explain.
>
> True, but that does not make it more intuitive; it makes it more useful
> in one particular case. Are there other scenarios where it is more (or
> less) useful?

Besides `sed`, there's also Python:

>>> import re
>>> re.subn(re.compile("a*"),"ITEM",";a;")
('ITEM;ITEM;ITEM', 3)

Many Lua users come from a background in which they know sed and Python.
That tends to influence what things they find intuitive.

Reply | Threaded
Open this post in threaded view
|

Re: A bug in string.gmatch and string.gsub?

Tom N Harris
On Monday, April 29, 2013 03:10:53 PM Dirk Laurie wrote:
> Besides `sed`, there's also Python:
> >>> import re
> >>> re.subn(re.compile("a*"),"ITEM",";a;")
>
> ('ITEM;ITEM;ITEM', 3)
>
> Many Lua users come from a background in which they know sed and Python.
> That tends to influence what things they find intuitive.

However, Perl and PHP do the same thing as Lua.

$ perl -e '$s=";a;";$s=~s/a*/ITEM/g;print "$s\n";'
ITEM;ITEMITEM;ITEM
$ php -r '$s=preg_replace("/a*/","ITEM",";a;");print("$s\n");'
ITEM;ITEMITEM;ITEM
$ php -r '$s=ereg_replace("a*","ITEM",";a;");print("$s\n");'
ITEM;ITEMITEM;ITEM

And LPEG...
> re = require"re"
> print(re.gsub(";a;","'a'*","ITEM"))
/home/tnharris/share/lua/5.2/re.lua:230: loop body may accept empty string

--
tom <[hidden email]>

12