Safe user-facing pattern matching.

classic Classic list List threaded Threaded
32 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Safe user-facing pattern matching.

Jorge Visca
As we all know, patterns can be crafted to hog the CPU. Suppose you want
to offer the possibility to run user provided patterns on a biggish
text, let's say trough webpage. What would be the simplest alternative,
keeping as much expressive power as possible, while being safe, to
standard Lua patterns?

Jorge


Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Rena
On Mar 30, 2017 1:06 AM, "Jorge" <[hidden email]> wrote:
As we all know, patterns can be crafted to hog the CPU. Suppose you want to offer the possibility to run user provided patterns on a biggish text, let's say trough webpage. What would be the simplest alternative, keeping as much expressive power as possible, while being safe, to standard Lua patterns?

Jorge


Probably ulimit.

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Jorge Visca

Well, that does not make the program safe, it just kills it, no? :)

I was thinking something Lua only, like pattern subset/sanitization or such

Jorge


On 30/03/17 10:07, Rena wrote:
On Mar 30, 2017 1:06 AM, "Jorge" <[hidden email]> wrote:
As we all know, patterns can be crafted to hog the CPU. Suppose you want to offer the possibility to run user provided patterns on a biggish text, let's say trough webpage. What would be the simplest alternative, keeping as much expressive power as possible, while being safe, to standard Lua patterns?

Jorge


Probably ulimit.


Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Viacheslav Usov
I think we need a general solution for this problem.

While a Lua script executes Lua bytecode, it can be stopped by injecting a debug hook. But once it is inside a C function, standard or third-party, nothing at all can be done (short of killing the process). Why cannot we have an atomic flag in the Lua state structure that can be set whenever we want to abort execution, which can then be checked by potentially long running C functions? Not saying this is a panacea, but this is something which would simplify building a sandbox.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Coda Highland
In reply to this post by Jorge Visca
On Thu, Mar 30, 2017 at 8:38 AM, Jorge <[hidden email]> wrote:
> Well, that does not make the program safe, it just kills it, no? :)
>
> I was thinking something Lua only, like pattern subset/sanitization or such
>
> Jorge

In terms of computational theory, there's not a whole lot you can do
without sacrificing something significant somewhere. The biggest time
sink in a pattern-matching system is the need to perform backtracking
when an option fails.

Every regular expression (in the true computational sense, not in the
Perl sense) can be converted to a non-backtracking parser. A
non-backtracking parser is guaranteed to complete in linear time.
However, that transformation can itself be expensive, and for a
one-off pattern on a small subject this transformation is probably
more expensive than just running the backtracking version.

You could, hypothetically, create a pattern syntax that makes it
easier to construct a non-backtracking parser out of it, but you might
be surprised what you lose in convenience. You don't lose anything in
true expressive power, but the equivalent patterns might end up being
substantially bulkier to write. Basically, every form of repetition
has to encode its termination condition somehow, and alternatives have
to go with the first available choice because once the decision is
made there's no backtracking.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Tomás Guisasola-2
In reply to this post by Viacheslav Usov
Hi Viacheslav

> While a Lua script executes Lua bytecode, it can be stopped by
> injecting a debug hook. But once it is inside a C function, standard
> or third-party, nothing at all can be done (short of killing the
> process). Why cannot we have an atomic flag in the Lua state structure
> that can be set whenever we want to abort execution, which can then be
> checked by potentially long running C functions? Not saying this is a
> panacea, but this is something which would simplify building a
> sandbox.
But as you said, this "long running C function" could be a third-party
one, thus there is no guarantee the flag will be checked, isn't it?

Regards,
Tomás

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Rena
In reply to this post by Jorge Visca
On Thu, Mar 30, 2017 at 11:38 AM, Jorge <[hidden email]> wrote:

Well, that does not make the program safe, it just kills it, no? :)


Yes, so you do the work in a child process and present the result (whether it's success or killed-due-to-resource-limits) to the user. 

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Jorge Visca
In reply to this post by Coda Highland
On 30/03/17 13:10, Coda Highland wrote:

> On Thu, Mar 30, 2017 at 8:38 AM, Jorge <[hidden email]> wrote:
>> Well, that does not make the program safe, it just kills it, no? :)
>>
>> I was thinking something Lua only, like pattern subset/sanitization or such
>>
>> Jorge
> In terms of computational theory, there's not a whole lot you can do
> without sacrificing something significant somewhere. The biggest time
> sink in a pattern-matching system is the need to perform backtracking
> when an option fails.
So, alternatively, what would be the most expressive non-backtracking
pattern? For example, is there something more powerful than the * and +
wildcard matching?
Jorge



Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Coda Highland
On Thu, Mar 30, 2017 at 12:55 PM, Jorge <[hidden email]> wrote:

> On 30/03/17 13:10, Coda Highland wrote:
>>
>> On Thu, Mar 30, 2017 at 8:38 AM, Jorge <[hidden email]> wrote:
>>>
>>> Well, that does not make the program safe, it just kills it, no? :)
>>>
>>> I was thinking something Lua only, like pattern subset/sanitization or
>>> such
>>>
>>> Jorge
>>
>> In terms of computational theory, there's not a whole lot you can do
>> without sacrificing something significant somewhere. The biggest time
>> sink in a pattern-matching system is the need to perform backtracking
>> when an option fails.
>
> So, alternatively, what would be the most expressive non-backtracking
> pattern? For example, is there something more powerful than the * and +
> wildcard matching?
> Jorge

The problem isn't so much one of power and expressiveness. It's one of
conciseness.

Regular expressions deal with ambiguity using backtracking. Consider
the expression "(ab)*ac". It's simple enough, but it's impossible to
evaluate on ANY subject string without backtracking. (The expression
"a(ba)*c" on the other hand matches the same set of strings and can be
evaluated without backtracking, but it captures a different set of
groups.)

On the opposite side of the conciseness spectrum, the following
pseudocode also matches the same set of strings, in strictly linear
time with no backtracking:

state = 1
for each character in subject:
  if state == 1:
    if character == 'a':
      state = 2
    else:
      reject
  else if state == 2:
    if character == 'b':
      -- If you wanted to, this would be a place
      -- where you could emit a capture group
      state = 1
    else if character == 'c':
      accept
    else:
      reject

One of the local LPeg gurus could show you an example of some code
somewhere in between, but LPeg still uses backtracking. (It also
supports grammars of a higher order than regular expressions.)

There are many other syntaxes and formalisms in common use, of varying
complexities and power.

So the real question is... what do you want the user's pattern to LOOK
like? And how much CPU time are you willing to spend transforming the
user's pattern into something that the computer can work with? How
much work do you want to put into building the tools, if there isn't a
standard one available?

That said, this is a well-researched field of study. I'll point you to
http://hackingoff.com/compilers/regular-expression-to-nfa-dfa for a
place you can start looking.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Viacheslav Usov
In reply to this post by Tomás Guisasola-2
On Thu, Mar 30, 2017 at 6:59 PM, tomas <[hidden email]> wrote:

> But as you said, this "long running C function" could be a third-party one, thus there is no guarantee the flag will be checked, isn't it?

There is no guarantee, but when there is common "abort" API that third-party libraries can use, some of them will use it. Then anyone needing a sandboxed Lua environment can take those libraries that support it, patch those that do not, or develop something from scratch. With common API, all of that will inter-operate nicely.

Cheers,
V.
Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Gé Weijers
Many regular expression implementations have added features that can't be implemented by a finite state automaton, and they end up requiring backtracking. Lua's "%bxy" feature is one of them, as are 'back references' in the search string: ^(.*)-\1$  (two equal strings separated by a hyphen)

This problem could be solved by using a regex implementation that limits itself to 'true' regular expressions that can be implemented by an FSA.

A very full featured one is:


Downside: it's written in C++, it should still be fairly easy to add a simple interface wrapper to Lua.




On Fri, Mar 31, 2017 at 1:21 AM, Viacheslav Usov <[hidden email]> wrote:
On Thu, Mar 30, 2017 at 6:59 PM, tomas <[hidden email]> wrote:

> But as you said, this "long running C function" could be a third-party one, thus there is no guarantee the flag will be checked, isn't it?

There is no guarantee, but when there is common "abort" API that third-party libraries can use, some of them will use it. Then anyone needing a sandboxed Lua environment can take those libraries that support it, patch those that do not, or develop something from scratch. With common API, all of that will inter-operate nicely.

Cheers,
V.



--
--

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Rain Gloom
In reply to this post by Viacheslav Usov
> but once it is inside a C function, standard or third-party, nothing at all can be done (short of killing the process).

There is a pure Lua implementation of the pattern matching functions in OpenComputers [1].
[1]: https://github.com/MightyPirates/OpenComputers/blob/master-MC1.7.10/src/main/resources/assets/opencomputers/lua/machine.lua#L35
(I am not the author)
Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Roberto Ierusalimschy
In reply to this post by Gé Weijers
> Many regular expression implementations have added features that can't be
> implemented by a finite state automaton, and they end up requiring
> backtracking. Lua's "%bxy" feature is one of them, as are 'back references'
> in the search string: ^(.*)-\1$  (two equal strings separated by a hyphen)

Small correction: "%bxy" certainly cannot be implemented by a finite
state automaton, but it does not require backtracking. It is
easy to do an efficient (linear) implementation of "%bxy" and integrate
it with a finite state automaton. ('back references' as in Perl, on the
other hand, are NP complete.)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Gé Weijers

This may be tricky on regular(-ish) expressions like "^%(*%b()$", given input like "(((((((((((((((((((((()))())".
This can definitely be done using a sub-exponential algorithm (e.g. generate an ambiguous context free grammar from the expression, and use Early's parser to get O(n^3) performance).
I don't yet see how to do this in linear time by bolting a recognizer for the "%b()" pattern onto a finite state automaton.

"^%(*%b()%b)($" should be a fun regex too :-) 



On Fri, Mar 31, 2017 at 7:58 AM, Roberto Ierusalimschy <[hidden email]> wrote:
> Many regular expression implementations have added features that can't be
> implemented by a finite state automaton, and they end up requiring
> backtracking. Lua's "%bxy" feature is one of them, as are 'back references'
> in the search string: ^(.*)-\1$  (two equal strings separated by a hyphen)

Small correction: "%bxy" certainly cannot be implemented by a finite
state automaton, but it does not require backtracking. It is
easy to do an efficient (linear) implementation of "%bxy" and integrate
it with a finite state automaton. ('back references' as in Perl, on the
other hand, are NP complete.)

-- Roberto




--
--

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Martin
On 04/05/2017 06:14 PM, Gé Weijers wrote:
> I don't yet see how to do this in linear time by bolting a recognizer
> for the "%b()" pattern onto a finite state automaton.

I think this is impossible for canonic finite state automata (which is
just a series of triples (state, cur_char, next_state) from limited
alphabet).

But is possible if we add some "condition" and "action" fields
that contain functions and procedures:

FSA (finite state automata):

 state | cur_char |   cond   |            action          | next_state
-------+----------+----------+----------------------------+-----------
 SCAN  |   EOL    |          |                                 | DONE
 SCAN  |    (     |          | starts[++deep] = POS            | SCAN
 SCAN  |    )     | deep > 0 | capt = seg(starts[deep--], POS) | SCAN
 SCAN  |    )     |          |                                 | SCAN
 SCAN  |   ANY    |          |                                 | SCAN

Surely by allowing this extension we multiply execution time by
some constant factor (as we may no longer get new state just by
referencing FSA[state][cur_char].next_state, we have to scan entries
of FSA[state][cur_char] checking conditions).

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Coda Highland
On Mon, Apr 10, 2017 at 11:37 PM, Martin <[hidden email]> wrote:

> On 04/05/2017 06:14 PM, Gé Weijers wrote:
>> I don't yet see how to do this in linear time by bolting a recognizer
>> for the "%b()" pattern onto a finite state automaton.
>
> I think this is impossible for canonic finite state automata (which is
> just a series of triples (state, cur_char, next_state) from limited
> alphabet).
>
> But is possible if we add some "condition" and "action" fields
> that contain functions and procedures:
>
> FSA (finite state automata):
>
>  state | cur_char |   cond   |            action          | next_state
> -------+----------+----------+----------------------------+-----------
>  SCAN  |   EOL    |          |                                 | DONE
>  SCAN  |    (     |          | starts[++deep] = POS            | SCAN
>  SCAN  |    )     | deep > 0 | capt = seg(starts[deep--], POS) | SCAN
>  SCAN  |    )     |          |                                 | SCAN
>  SCAN  |   ANY    |          |                                 | SCAN
>
> Surely by allowing this extension we multiply execution time by
> some constant factor (as we may no longer get new state just by
> referencing FSA[state][cur_char].next_state, we have to scan entries
> of FSA[state][cur_char] checking conditions).
>
> -- Martin

I would have implemented this as a four-state FSA with an action that
happens to include a conditional in it. Doing that doesn't actually
impact the design of the FSA itself and therefore carries no more
overhead than just supporting actions at all would.

That said, usually when I implement this kind of thing I dispense with
the strictness of the formalism and I instead have the actions return
the next state instead of having next_state as a distinct, fixed
entity.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Gé Weijers
In reply to this post by Martin
On Mon, Apr 10, 2017 at 13:38 Martin <[hidden email]> wrote:
On 04/05/2017 06:14 PM, Gé Weijers wrote:
> I don't yet see how to do this in linear time by bolting a recognizer
> for the "%b()" pattern onto a finite state automaton.

I think this is impossible for canonic finite state automata (which is
just a series of triples (state, cur_char, next_state) from limited
alphabet).

But is possible if we add some "condition" and "action" fields
that contain functions and procedures:

FSA (finite state automata):

 state | cur_char |   cond   |            action          | next_state
-------+----------+----------+----------------------------+-----------
 SCAN  |   EOL    |          |                                 | DONE
 SCAN  |    (     |          | starts[++deep] = POS            | SCAN
 SCAN  |    )     | deep > 0 | capt = seg(starts[deep--], POS) | SCAN
 SCAN  |    )     |          |                                 | SCAN
 SCAN  |   ANY    |          |                                 | SCAN

A regular expression like ".*%b()" will match "(((((((())", which your automaton will not unless it's a nondeterministic one. 
You cannot know how many '('s are matched by ".*" until you get to the end of the string.

I don't think there's a linear time algorithm to match expressions like this in the general case. You either backtrack, or your deterministic state representation is a list only bounded by the input size. 

--
-- Gė
Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Coda Highland
On Thu, Apr 20, 2017 at 5:48 PM, Gé Weijers <[hidden email]> wrote:

> On Mon, Apr 10, 2017 at 13:38 Martin <[hidden email]> wrote:
>>
>> On 04/05/2017 06:14 PM, Gé Weijers wrote:
>> > I don't yet see how to do this in linear time by bolting a recognizer
>> > for the "%b()" pattern onto a finite state automaton.
>>
>> I think this is impossible for canonic finite state automata (which is
>> just a series of triples (state, cur_char, next_state) from limited
>> alphabet).
>>
>> But is possible if we add some "condition" and "action" fields
>> that contain functions and procedures:
>>
>> FSA (finite state automata):
>>
>>  state | cur_char |   cond   |            action          | next_state
>> -------+----------+----------+----------------------------+-----------
>>  SCAN  |   EOL    |          |                                 | DONE
>>  SCAN  |    (     |          | starts[++deep] = POS            | SCAN
>>  SCAN  |    )     | deep > 0 | capt = seg(starts[deep--], POS) | SCAN
>>  SCAN  |    )     |          |                                 | SCAN
>>  SCAN  |   ANY    |          |                                 | SCAN
>
>
> A regular expression like ".*%b()" will match "(((((((())", which your
> automaton will not unless it's a nondeterministic one.
> You cannot know how many '('s are matched by ".*" until you get to the end
> of the string.
>
> I don't think there's a linear time algorithm to match expressions like this
> in the general case. You either backtrack, or your deterministic state
> representation is a list only bounded by the input size.
>
> Gé
> --
> -- Gė

In the general case, no, you're right. Context-free languages require
polynomial time.

However, the LR subset of the context-free languages CAN be matched in
linear time. A great many grammars can be transformed into LR. The
language of matched parentheses, for example, is in LR. XML is in LR.
Javascript was in LR at one point in its history. (I don't know if all
of the newer features still keep it there.)

You ARE right that LR parsers need an internal stack, and that stack
in the worst case can grow in the size of the input. (That said, this
is true for backtracking parsers as well.)

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Martin
In reply to this post by Gé Weijers
On 04/20/2017 05:48 PM, Gé Weijers wrote:

> On Mon, Apr 10, 2017 at 13:38 Martin <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     On 04/05/2017 06:14 PM, Gé Weijers wrote:
>     > I don't yet see how to do this in linear time by bolting a recognizer
>     > for the "%b()" pattern onto a finite state automaton.
>
>     I think this is impossible for canonic finite state automata (which is
>     just a series of triples (state, cur_char, next_state) from limited
>     alphabet).
>
>     But is possible if we add some "condition" and "action" fields
>     that contain functions and procedures:
>
>     FSA (finite state automata):
>
>      state | cur_char |   cond   |            action          | next_state
>     -------+----------+----------+----------------------------+-----------
>      SCAN  |   EOL    |          |                                 | DONE
>      SCAN  |    (     |          | starts[++deep] = POS            | SCAN
>      SCAN  |    )     | deep > 0 | capt = seg(starts[deep--], POS) | SCAN
>      SCAN  |    )     |          |                                 | SCAN
>      SCAN  |   ANY    |          |                                 | SCAN
>
>
> A regular expression like ".*%b()" will match "(((((((())", which your
> automaton will not unless it's a nondeterministic one.
> You cannot know how many '('s are matched by ".*" until you get to the
> end of the string.
>
> I don't think there's a linear time algorithm to match expressions like
> this in the general case. You either backtrack, or your deterministic
> state representation is a list only bounded by the input size.
>
> -- Gė

Yes, AFAIK it's impossible to match ".*%b()" in linear time. That
message was regarding FSA to match "%b()".

This is due dreaded left recursion: in ('(((((((())'):match('.*%b()')
for any unprocessed char we don't know whether it belongs to "%b()" or
to ".*" part.

So we're diving, moving all chars to ".*" part until end of string,
where is no space for "%b()". Then, one level up we're moving last ")"
to %b part just to fail again. Finally we succeed in matching "()" as
"%b()" and return this result:

Lua 5.3.3  Copyright (C) 1994-2016 Lua.org, PUC-Rio
> ('(((((((())'):match('.*%b()')
(((((((()

BTW there is tricky part: probably you want "(())" in %b part.
In this case pattern should be ".-%b()":

> ('(((((((())'):match('(.-)(%b())')
((((((  (())

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: Safe user-facing pattern matching.

Coda Highland
On Sat, Apr 22, 2017 at 1:40 PM, Martin <[hidden email]> wrote:
> Yes, AFAIK it's impossible to match ".*%b()" in linear time. That
> message was regarding FSA to match "%b()".
>
> This is due dreaded left recursion: in ('(((((((())'):match('.*%b()')
> for any unprocessed char we don't know whether it belongs to "%b()" or
> to ".*" part.

Nope, you can match it in linear time with a stack and a counter:

1. Push every character onto a stack until you hit the end of the string.
2. If the stack is empty, end.
3. Pop a character.
4. If the character was not '(' or ')', end.
4. If the character was ')', increment the counter by 1.
5. If the character was '(', decrement the counter by 1.
6. If the counter is negative, end.
7. Go to 2.

/s/ Adam

12