Comprehending string.gsub

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Comprehending string.gsub

Robert Virding
I am having a bit of trouble comprehending a call to string.gsub in the pm.lua file in thelua-5.2.0 tests. There is a line:

assert(string.gsub("abc", "%w", "%1%0") == "aabbcc")

which works. My problem is that I don't why the substring %1 works and from where it gets its value. As far as I can see there is no captured substring at all. Using %2 generates an error (as I think it should).

Any help would be appreciated.

Robert

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Peter Cawley
On Fri, Apr 13, 2012 at 11:54 AM, Robert Virding
<[hidden email]> wrote:
> I am having a bit of trouble comprehending a call to string.gsub in the pm.lua file in thelua-5.2.0 tests. There is a line:
>
> assert(string.gsub("abc", "%w", "%1%0") == "aabbcc")
>
> which works. My problem is that I don't why the substring %1 works and from where it gets its value. As far as I can see there is no captured substring at all. Using %2 generates an error (as I think it should).

While I don't particularly like the behaviour, I can form a
pseudo-explanation as to what is going on. When repl is a table or
function, and the pattern specifies no captures, then the whole match
is used as the sole capture. One way of implementing this is to say
that if a pattern P specifies no captures, then it is treated as (P).
This effect then spills over to when repl is a string, causing %1 to
be the whole match.

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Robert Virding

----- Original Message -----

> On Fri, Apr 13, 2012 at 11:54 AM, Robert Virding
> <[hidden email]> wrote:
> > I am having a bit of trouble comprehending a call to string.gsub in
> > the pm.lua file in thelua-5.2.0 tests. There is a line:
> >
> > assert(string.gsub("abc", "%w", "%1%0") == "aabbcc")
> >
> > which works. My problem is that I don't why the substring %1 works
> > and from where it gets its value. As far as I can see there is no
> > captured substring at all. Using %2 generates an error (as I think
> > it should).
>
> While I don't particularly like the behaviour, I can form a
> pseudo-explanation as to what is going on. When repl is a table or
> function, and the pattern specifies no captures, then the whole match
> is used as the sole capture. One way of implementing this is to say
> that if a pattern P specifies no captures, then it is treated as (P).
> This effect then spills over to when repl is a string, causing %1 to
> be the whole match.

But isn't that what %0 is for. I have done some more testing and found that if no capture is given in the pattern string then %1 becomes the whole match, while if the pattern contains a capture then that will be used. %2 seems to behave properly

assert(string.gsub("abcd", "%w%w", "%1%0") == "ababcdcd")
assert(string.gsub("abcd", "(%w)%w", "%1%0") == "aabccd")

While I could go and check the C-code I get the feeling that as it is in the tests then it should behave like this, and I can't understand why. There is no mention or example of it in the manual.

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Peter Cawley
On Fri, Apr 13, 2012 at 4:07 PM, Robert Virding
<[hidden email]> wrote:
> But isn't that what %0 is for.

Yes, but %0 is only available for string replacements. %1 is
synthesized to be the entire match for all types of replacements, as
this is convenient for all non-string replacements.

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Robert Virding


----- Original Message -----
> On Fri, Apr 13, 2012 at 4:07 PM, Robert Virding
> <[hidden email]> wrote:
> > But isn't that what %0 is for.
>
> Yes, but %0 is only available for string replacements. %1 is
> synthesized to be the entire match for all types of replacements, as
> this is convenient for all non-string replacements.

I don't doubt you but I can't read that out of what it says in the manual:

"If repl is a string, then its value is used for replacement. The character % works as an escape character: any sequence in repl of the form %d, with d between 1 and 9, stands for the value of the d-th captured substring (see below). The sequence %0 stands for the whole match. The sequence %% stands for a single %.

If repl is a table, then the table is queried for every match, using the first capture as the key; if the pattern specifies no captures, then the whole match is used as the key.

If repl is a function, then this function is called every time a match occurs, with all captured substrings passed as arguments, in order; if the pattern specifies no captures, then the whole match is passed as a sole argument."

There is no problem implementing it this way. I will have to do some more checking with using tables and functions.

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Roberto Ierusalimschy
> > Yes, but %0 is only available for string replacements. %1 is
> > synthesized to be the entire match for all types of replacements, as
> > this is convenient for all non-string replacements.
>
> I don't doubt you but I can't read that out of what it says in the manual:

If you want to understand what is going on, Peter gave a good explanation.

If you want to be picky, the manual does not say that it is an error for
'gsub' to access a non-existent capture.

There is a myriad of cases that the manual does not explain
what will happen. For all of them, assume a clause: "If
<something-not-explained-before-or-after> happens, the behavior is
undefined".

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Alexander Gladysh
On Fri, Apr 13, 2012 at 20:02, Roberto Ierusalimschy
<[hidden email]> wrote:

>> > Yes, but %0 is only available for string replacements. %1 is
>> > synthesized to be the entire match for all types of replacements, as
>> > this is convenient for all non-string replacements.
>>
>> I don't doubt you but I can't read that out of what it says in the manual:
>
> If you want to understand what is going on, Peter gave a good explanation.
>
> If you want to be picky, the manual does not say that it is an error for
> 'gsub' to access a non-existent capture.
>
> There is a myriad of cases that the manual does not explain
> what will happen. For all of them, assume a clause: "If
> <something-not-explained-before-or-after> happens, the behavior is
> undefined".

Unspecified? Implementation-specific? Undefined sounds scary to me. :-)

Alexander.

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Matthew Wild
In reply to this post by Roberto Ierusalimschy
On 13 April 2012 17:02, Roberto Ierusalimschy <[hidden email]> wrote:

>> > Yes, but %0 is only available for string replacements. %1 is
>> > synthesized to be the entire match for all types of replacements, as
>> > this is convenient for all non-string replacements.
>>
>> I don't doubt you but I can't read that out of what it says in the manual:
>
> If you want to understand what is going on, Peter gave a good explanation.
>
> If you want to be picky, the manual does not say that it is an error for
> 'gsub' to access a non-existent capture.

Good enough for me - but then if the behaviour is undefined, why is it
in the tests? :)

Regards,
Matthew

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Patrick Donnelly
On Fri, Apr 13, 2012 at 12:47 PM, Matthew Wild <[hidden email]> wrote:

> On 13 April 2012 17:02, Roberto Ierusalimschy <[hidden email]> wrote:
>>> > Yes, but %0 is only available for string replacements. %1 is
>>> > synthesized to be the entire match for all types of replacements, as
>>> > this is convenient for all non-string replacements.
>>>
>>> I don't doubt you but I can't read that out of what it says in the manual:
>>
>> If you want to understand what is going on, Peter gave a good explanation.
>>
>> If you want to be picky, the manual does not say that it is an error for
>> 'gsub' to access a non-existent capture.
>
> Good enough for me - but then if the behaviour is undefined, why is it
> in the tests? :)

Questions like these is probably why the Lua test suite was "hidden"
away for as long as it was and why the Lua source code repository is
still hidden. :)

--
- Patrick Donnelly

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Roberto Ierusalimschy
In reply to this post by Matthew Wild
> Good enough for me - but then if the behaviour is undefined, why is it
> in the tests? :)

Another non-written rule in Lua is that, even for undefined behavior,
the interpreter should not crash. So, the tests must cover undefined
cases too.

Moreover, Lua uses whitebox testing. Many tests in Lua assume specific
behaviors that we know it is true for that specific implementation. (For
instance, several tests assume specific wordings for error messages.)
This test style allows much more strict tests, which are not possible
testing only the specification.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

David Manura
On Fri, Apr 13, 2012 at 2:33 PM, Roberto Ierusalimschy
<[hidden email]> wrote:

>>[...] if the behaviour is undefined, why is it in the tests? :)
> Another non-written rule in Lua is that, even for undefined behavior,
> the interpreter should not crash. So, the tests must cover undefined
> cases too.
>
> Moreover, Lua uses whitebox testing. Many tests in Lua assume specific
> behaviors that we know it is true for that specific implementation. (For
> instance, several tests assume specific wordings for error messages.)
> This test style allows much more strict tests, which are not possible
> testing only the specification.

Those re-implementing Lua may want to take the initiative to rewrite
the test suite in a form like this:

  function unspecified(f)
    if WARN_UNSPECIFIED == 'fail' then
      f()
    else
      local ok, err = pcall(f) -- check that it doesn't crash
      if WARN_UNSPECIFIED then
        io.stderr:write('WARN:failure in unspecified:', err, '\n')
      end
    end
  end
  . . .
  unspecified(function() assert(string.gsub("abc", "%w", "%1%0") ==
"aabbcc") end)

(But I'm not sure if the test suite has a license, so I suggest
checking if it's ok to release a modified test suite.)

This also reminds me of the idea [1] of writing a wrapper around the
"_G" library so that it warns or fails upon hitting unspecified
behavior.  Incidentally, that would be one way to test the test suite
concerning unspecified behavior.

[1] http://lua-users.org/lists/lua-l/2012-03/msg00936.html


On Fri, Apr 13, 2012 at 12:16 PM, Alexander Gladysh <[hidden email]> wrote:
> Unspecified? Implementation-specific? Undefined sounds scary to me. :-)

Background: http://stackoverflow.com/questions/2397984/undefined-unspecified-and-implementation-defined-behavior

The Lua Reference Manual briefly uses basically all three terms, but I
don't think it makes a distinction between unspecified and undefined.
As Roberto says, the Lua interpreter should not crash (though there
may be an issue of out-of-memory [2] without a special allocator and
in 5.1 at least certain functions could crash [3]).

Is it a true statement that Lua 5.1 allows hitting undefined behavior
in standard C but 5.2 does not?

[2] http://lua-users.org/lists/lua-l/2008-10/msg00415.html
[3] http://lua-users.org/lists/lua-l/2008-09/msg00157.html

Reply | Threaded
Open this post in threaded view
|

Re: Comprehending string.gsub

Robert Virding
Yes, I am re-implementing Lua which is I am asking nit-picky questions like this. Even if something is unspecified it would still be good to behave in the same way as the standard implementation. I know that in some cases things are unspecified but people still use them and "know" how they behave and expect them to behave in a certain manner.

One of my favourite peeves is #table when the table part doesn't consist of a sequence. It still works and returns a value which is indistinguishable from a valid value. And in many case the value is sensible, but not always. I know people shouldn't use like this but they do, and they expect it to behave in the "standard" way.

I come from an environment where not much is unspecified and doing illegal things always results in an error.

Robert

----- Original Message -----

> On Fri, Apr 13, 2012 at 2:33 PM, Roberto Ierusalimschy
> <[hidden email]> wrote:
> >>[...] if the behaviour is undefined, why is it in the tests? :)
> > Another non-written rule in Lua is that, even for undefined
> > behavior,
> > the interpreter should not crash. So, the tests must cover
> > undefined
> > cases too.
> >
> > Moreover, Lua uses whitebox testing. Many tests in Lua assume
> > specific
> > behaviors that we know it is true for that specific implementation.
> > (For
> > instance, several tests assume specific wordings for error
> > messages.)
> > This test style allows much more strict tests, which are not
> > possible
> > testing only the specification.
>
> Those re-implementing Lua may want to take the initiative to rewrite
> the test suite in a form like this:
>
>   function unspecified(f)
>     if WARN_UNSPECIFIED == 'fail' then
>       f()
>     else
>       local ok, err = pcall(f) -- check that it doesn't crash
>       if WARN_UNSPECIFIED then
>         io.stderr:write('WARN:failure in unspecified:', err, '\n')
>       end
>     end
>   end
>   . . .
>   unspecified(function() assert(string.gsub("abc", "%w", "%1%0") ==
> "aabbcc") end)
>
> (But I'm not sure if the test suite has a license, so I suggest
> checking if it's ok to release a modified test suite.)
>
> This also reminds me of the idea [1] of writing a wrapper around the
> "_G" library so that it warns or fails upon hitting unspecified
> behavior.  Incidentally, that would be one way to test the test suite
> concerning unspecified behavior.
>
> [1] http://lua-users.org/lists/lua-l/2012-03/msg00936.html
>
>
> On Fri, Apr 13, 2012 at 12:16 PM, Alexander Gladysh
> <[hidden email]> wrote:
> > Unspecified? Implementation-specific? Undefined sounds scary to me.
> > :-)
>
> Background:
> http://stackoverflow.com/questions/2397984/undefined-unspecified-and-implementation-defined-behavior
>
> The Lua Reference Manual briefly uses basically all three terms, but
> I
> don't think it makes a distinction between unspecified and undefined.
> As Roberto says, the Lua interpreter should not crash (though there
> may be an issue of out-of-memory [2] without a special allocator and
> in 5.1 at least certain functions could crash [3]).
>
> Is it a true statement that Lua 5.1 allows hitting undefined behavior
> in standard C but 5.2 does not?
>
> [2] http://lua-users.org/lists/lua-l/2008-10/msg00415.html
> [3] http://lua-users.org/lists/lua-l/2008-09/msg00157.html
>
>