Inconsistency with the manual

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistency with the manual

Soni "They/Them" L.
The manual states:

%f[set], a frontier pattern; such item matches an empty string at any
position such that the next character belongs to set and the previous
character does not belong to set. The set set is interpreted as
previously described. The beginning and the end of the subject are
handled as if they were the character '\0'.

A pattern is a sequence of pattern items. A caret '^' at the beginning
of a pattern anchors the match at the beginning of the subject string.

However, I noticed something kinda weird:

string.match("''", "%f[']'", 2) --> nil
string.match("''", "^'", 2) --> '

You'd think the first ' in the first string.match would be handled as
the character '\0', consistent with '^'. Or, you'd think the latter
would never match.

So uh, '^' -> runs the match only once, at the specified position?

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency with the manual

Martin
On 10/10/2017 10:43 PM, Soni L. wrote:
> However, I noticed something kinda weird:
>
> string.match("''", "%f[']'", 2) --> nil
> string.match("''", "^'", 2) --> '

string.match("''", "%f[']'", 2). From position 2 of string [['']]
locate to index <i> such that s[i] ~= [[']] and s[i + 1] == [[']] and
s[i + 1] == [[']].

There is no such index so nil is returned. In case of string [['' ']]
there is match.

string.match("''", "^'", 2). From position 2 of string [['']] find
string [[']]. Do not skip characters (due "^" anchor). This matches
[[']], second apostrophe.

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency with the manual

Soni "They/Them" L.


On 2017-10-10 10:54 PM, Martin wrote:

> On 10/10/2017 10:43 PM, Soni L. wrote:
>> However, I noticed something kinda weird:
>>
>> string.match("''", "%f[']'", 2) --> nil
>> string.match("''", "^'", 2) --> '
> string.match("''", "%f[']'", 2). From position 2 of string [['']]
> locate to index <i> such that s[i] ~= [[']] and s[i + 1] == [[']] and
> s[i + 1] == [[']].
>
> There is no such index so nil is returned. In case of string [['' ']]
> there is match.
>
> string.match("''", "^'", 2). From position 2 of string [['']] find
> string [[']]. Do not skip characters (due "^" anchor). This matches
> [[']], second apostrophe.
>
> -- Martin
>

But from position 1, the first pattern matches. From reading the manual,
both operate on the start of the subject string, so either the first
should match or the second should fail.

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency with the manual

Jonathan Goble
On Tue, Oct 10, 2017 at 10:06 PM Soni L. <[hidden email]> wrote:


On 2017-10-10 10:54 PM, Martin wrote:
> On 10/10/2017 10:43 PM, Soni L. wrote:
>> However, I noticed something kinda weird:
>>
>> string.match("''", "%f[']'", 2) --> nil
>> string.match("''", "^'", 2) --> '
> string.match("''", "%f[']'", 2). From position 2 of string [['']]
> locate to index <i> such that s[i] ~= [[']] and s[i + 1] == [[']] and
> s[i + 1] == [[']].
>
> There is no such index so nil is returned. In case of string [['' ']]
> there is match.
>
> string.match("''", "^'", 2). From position 2 of string [['']] find
> string [[']]. Do not skip characters (due "^" anchor). This matches
> [[']], second apostrophe.
>
> -- Martin
>

But from position 1, the first pattern matches. From reading the manual,
both operate on the start of the subject string, so either the first
should match or the second should fail.

Regarding the frontier pattern: it does not match ON a character. It matches the boundary BETWEEN two characters, specifically the boundary immediately before the next character to be matched. So when matching from position 1, the frontier pattern attempts to match the boundary immediately before the first character in the string. Since there is no previous character at this boundary, the frontier pattern matches against "\0" in lieu of the previous character.

"\0" ~= [[']] and [[']] (the first character of the string) == [[']], so the frontier pattern successfully matches. The literal [[']] then matches the first character of the string. If you passed this pattern to string.find instead, it would return a start and end point both equal to 1, since only the first character was matched. The frontier pattern does not consume the source string or add anything to the overall match; it's essentially just an assertion.

Now when you match from position 2, Lua first tries to match the frontier pattern at the boundary immediately before the character next to be matched. In this case, that is the second apostrophe, so the boundary to be checked is between the first and second characters of the source string. The character previous to that boundary is the first apostrophe, which is equal to an apostrophe, so the frontier assertion fails and the function returns nil.

The takeaways here are that frontier patterns check a boundary, not a character, and specifying an explicit start does not prevent Lua from checking the character previous to that if the pattern starts with a frontier assertion.

Now to the second pattern. The "^" character can be a bit surprising when an explicit start position is specified, because the description in the manual does not match the actual behavior. The manual states that the caret "anchors the match at the beginning of the subject string", but in fact, the caret actually anchors the match at the starting point (i.e. it prevents the matching engine from skipping characters).

Thus your second pattern attempts to match [[']] beginning at position 2, without skipping characters, Position 2 is in fact [[']], so the match succeeds. As a further example, consider string.match("'!'", "^'", 2); this will fail to match and return nil because the [[']] cannot match at position 2 and the caret prevents it from skipping characters. The fact that the first character is [[']] is irrelevant, since you've overridden that by telling Lua to start at position 2 instead.

The bug in this second issue appears to be in the documentation, not the code, since there's no good reason why Lua should ignore an explicit start argument when the caret is given, and valid reasons why it shouldn't ignore the start argument. Consider a simple tokenizer that repeatedly matches consecutive tokens without skipping anything else; the pattern could be prefixed with a ^ and suffixed with a position capture, and on each iteration, the result from the previous position capture is fed to the next call as the start argument. In this case, you would rely on the ^ meaning "anchor to start point", and a nil result would mean a syntax error in whatever you're tokenizing.
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency with the manual

Soni "They/Them" L.


On 2017-10-10 11:48 PM, Jonathan Goble wrote:

> On Tue, Oct 10, 2017 at 10:06 PM Soni L. <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>
>
>     On 2017-10-10 10:54 PM, Martin wrote:
>     > On 10/10/2017 10:43 PM, Soni L. wrote:
>     >> However, I noticed something kinda weird:
>     >>
>     >> string.match("''", "%f[']'", 2) --> nil
>     >> string.match("''", "^'", 2) --> '
>     > string.match("''", "%f[']'", 2). From position 2 of string [['']]
>     > locate to index <i> such that s[i] ~= [[']] and s[i + 1] ==
>     [[']] and
>     > s[i + 1] == [[']].
>     >
>     > There is no such index so nil is returned. In case of string
>     [['' ']]
>     > there is match.
>     >
>     > string.match("''", "^'", 2). From position 2 of string [['']] find
>     > string [[']]. Do not skip characters (due "^" anchor). This matches
>     > [[']], second apostrophe.
>     >
>     > -- Martin
>     >
>
>     But from position 1, the first pattern matches. From reading the
>     manual,
>     both operate on the start of the subject string, so either the first
>     should match or the second should fail.
>
>
> Regarding the frontier pattern: it does not match ON a character. It
> matches the boundary BETWEEN two characters, specifically the boundary
> immediately before the next character to be matched. So when matching
> from position 1, the frontier pattern attempts to match the boundary
> immediately before the first character in the string. Since there is
> no previous character at this boundary, the frontier pattern matches
> against "\0" in lieu of the previous character.
>
> "\0" ~= [[']] and [[']] (the first character of the string) == [[']],
> so the frontier pattern successfully matches. The literal [[']] then
> matches the first character of the string. If you passed this pattern
> to string.find instead, it would return a start and end point both
> equal to 1, since only the first character was matched. The frontier
> pattern does not consume the source string or add anything to the
> overall match; it's essentially just an assertion.
>
> Now when you match from position 2, Lua first tries to match the
> frontier pattern at the boundary immediately before the character next
> to be matched. In this case, that is the second apostrophe, so the
> boundary to be checked is between the first and second characters of
> the source string. The character previous to that boundary is the
> first apostrophe, which is equal to an apostrophe, so the frontier
> assertion fails and the function returns nil.
>
> The takeaways here are that frontier patterns check a boundary, not a
> character, and specifying an explicit start does not prevent Lua from
> checking the character previous to that if the pattern starts with a
> frontier assertion.
>
> Now to the second pattern. The "^" character can be a bit surprising
> when an explicit start position is specified, because the description
> in the manual does not match the actual behavior. The manual states
> that the caret "anchors the match at the beginning of the subject
> string", but in fact, the caret actually anchors the match at the
> starting point (i.e. it prevents the matching engine from skipping
> characters).
>
> Thus your second pattern attempts to match [[']] beginning at position
> 2, without skipping characters, Position 2 is in fact [[']], so the
> match succeeds. As a further example, consider string.match("'!'",
> "^'", 2); this will fail to match and return nil because the [[']]
> cannot match at position 2 and the caret prevents it from skipping
> characters. The fact that the first character is [[']] is irrelevant,
> since you've overridden that by telling Lua to start at position 2
> instead.
>
> The bug in this second issue appears to be in the documentation, not
> the code, since there's no good reason why Lua should ignore an
> explicit start argument when the caret is given, and valid reasons why
> it shouldn't ignore the start argument. Consider a simple tokenizer
> that repeatedly matches consecutive tokens without skipping anything
> else; the pattern could be prefixed with a ^ and suffixed with a
> position capture, and on each iteration, the result from the previous
> position capture is fed to the next call as the start argument. In
> this case, you would rely on the ^ meaning "anchor to start point",
> and a nil result would mean a syntax error in whatever you're tokenizing.

Or you could interpret the caret definition to define the beginning of
the subject string as starting at the explicit start position. In which
case frontier is wrong.

It's one of:

- caret is wrong
- frontier is wrong
- manual is wrong

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency with the manual

Dirk Laurie-2
2017-10-11 4:51 GMT+02:00 Soni L. <[hidden email]>:
> It's one of:
>
> - caret is wrong
> - frontier is wrong
> - manual is wrong

A fourth possibility: behaviour of a bare "^" is undefined.

Lua is replete with silentlly undefined and implementation-dependent
behaviour. It is impossible for manual writers to anticipate to what
lengths fractious but resourceful users might go.

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency with the manual

Soni "They/Them" L.


On 2017-10-11 05:27 AM, Dirk Laurie wrote:

> 2017-10-11 4:51 GMT+02:00 Soni L. <[hidden email]>:
>> It's one of:
>>
>> - caret is wrong
>> - frontier is wrong
>> - manual is wrong
> A fourth possibility: behaviour of a bare "^" is undefined.
>
> Lua is replete with silentlly undefined and implementation-dependent
> behaviour. It is impossible for manual writers to anticipate to what
> lengths fractious but resourceful users might go.
>

I'd still like for %f to match at the start of a match. Is there a good
reason why it doesn't?

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency with the manual

Dirk Laurie-2
2017-10-11 12:06 GMT+02:00 Soni L. <[hidden email]>:

>
>
> On 2017-10-11 05:27 AM, Dirk Laurie wrote:
>>
>> 2017-10-11 4:51 GMT+02:00 Soni L. <[hidden email]>:
>>>
>>> It's one of:
>>>
>>> - caret is wrong
>>> - frontier is wrong
>>> - manual is wrong
>>
>> A fourth possibility: behaviour of a bare "^" is undefined.
>>
>> Lua is replete with silentlly undefined and implementation-dependent
>> behaviour. It is impossible for manual writers to anticipate to what
>> lengths fractious but resourceful users might go.
>>
>
> I'd still like for %f to match at the start of a match. Is there a good
> reason why it doesn't?

Specifying a starting position means that one is not interested in frontiers
before a certain point, not that a non-frontier should be treated as a frontier.

It is easy to get that behaviour if you want it: just code str:sub(n):match(pat)
instead of str:match(pat,n). Whereas if that behaviour were the default,
there would be no easy way of not getting it if you don't want it.

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistency with the manual

Tim Hill
In reply to this post by Jonathan Goble


On Oct 10, 2017, at 7:48 PM, Jonathan Goble <[hidden email]> wrote:

The bug in this second issue appears to be in the documentation, not the code, since there's no good reason why Lua should ignore an explicit start argument when the caret is given, and valid reasons why it shouldn't ignore the start argument. Consider a simple tokenizer that repeatedly matches consecutive tokens without skipping anything else; the pattern could be prefixed with a ^ and suffixed with a position capture, and on each iteration, the result from the previous position capture is fed to the next call as the start argument. In this case, you would rely on the ^ meaning "anchor to start point", and a nil result would mean a syntax error in whatever you're tokenizing.

I think the Lua docs are ok.

I would expect Lua to start at the ‘init’ position regardless of the presence of a carat anchor, and my reading of the term “subject string” in the docs would mean the string starting at the specified ‘init’ position (since this term occurs in a general discussion of pattern matching, not string.match). So the docs match the expected (by me) behavior. And you seem to say that in the paragraph above (and I agree with your reasoning).

—Tim