[mind game] Seeking for simple task impossible via regexps

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[mind game] Seeking for simple task impossible via regexps

Martin
Hello all!

I'm seeking for simply formulated and easy-coding string matching
task which is impossible to solve via currently existing regular
expressions handlers.

(Indeed I'm seeking for a complete definition of task classes that
regexps can't handle in principle.)

For example:

  Check that given string in lowercase contains no more than one entry
  of alphabet character.

  (so all permutations of subset of alphabet [a-z] should be matched)

Another example:

  Check that given string in lowercase contains all symbols of
  predefined alphabet. Ignore spaces.

  (so "the quick brown fox jumps over the lazy dog" should match)

My bet is that no existing regexp facilities may handle this
tasks. But maybe some of them can, so asking.

(I assume that calling user-supplied function from regexp handler
is not allowed. Or else this is trivially solved by Lua's gsub().)

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Coda Highland
On Fri, Feb 24, 2017 at 1:29 AM, Martin <[hidden email]> wrote:

> Hello all!
>
> I'm seeking for simply formulated and easy-coding string matching
> task which is impossible to solve via currently existing regular
> expressions handlers.
>
> (Indeed I'm seeking for a complete definition of task classes that
> regexps can't handle in principle.)
>
> For example:
>
>   Check that given string in lowercase contains no more than one entry
>   of alphabet character.
>
>   (so all permutations of subset of alphabet [a-z] should be matched)

Not hard at all. If you support backreferences in pattern strings
(common feature) then it's just /([a-z]).*\1/. If you don't, then it's
more verbose, but it's just /(a.*a|b.*b|c.*c| --(snip)-- |y.*y|z.*z)/.

> Another example:
>
>   Check that given string in lowercase contains all symbols of
>   predefined alphabet. Ignore spaces.
>
>   (so "the quick brown fox jumps over the lazy dog" should match)

Also quite tractable, using lookahead: /(?=.*a)(?=.*b) --(snip)-- (?=.*z)/.

> My bet is that no existing regexp facilities may handle this
> tasks. But maybe some of them can, so asking.
>
> (I assume that calling user-supplied function from regexp handler
> is not allowed. Or else this is trivially solved by Lua's gsub().)
>
> -- Martin

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Coda Highland
In reply to this post by Martin
On Fri, Feb 24, 2017 at 1:29 AM, Martin <[hidden email]> wrote:
> Hello all!
>
> I'm seeking for simply formulated and easy-coding string matching
> task which is impossible to solve via currently existing regular
> expressions handlers.
>
> (Indeed I'm seeking for a complete definition of task classes that
> regexps can't handle in principle.)

To actually answer the question, though, the canonical example of "not
possible with regexps" is balanced braces. A regexp cannot verify that
every ( has a matching ) in the right place, but it's trivially easy
to implement in imperative code with a loop and a counter.

It's also trivial with a more powerful language processing tool, such as LPeg.

Basically, regular expressions are incapable of retaining
relationships between specific entities in the input string across
distance. Every step of evaluation can be done knowing only what part
of the regular expression has been evaluated so far and where you're
currently sitting in the input string. If any more information than
that is necessary, a regexp can't do it.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Martin
In reply to this post by Coda Highland
On 02/23/2017 03:44 PM, Coda Highland wrote:

>>   Check that given string in lowercase contains no more than one entry
>>   of alphabet character.
>>
>>   (so all permutations of subset of alphabet [a-z] should be matched)
>
> Not hard at all. If you support backreferences in pattern strings
> (common feature) then it's just /([a-z]).*\1/. If you don't, then it's
> more verbose, but it's just /(a.*a|b.*b|c.*c| --(snip)-- |y.*y|z.*z)/.
>
>> Another example:
>>
>>   Check that given string in lowercase contains all symbols of
>>   predefined alphabet. Ignore spaces.
>>
>>   (so "the quick brown fox jumps over the lazy dog" should match)
>
> Also quite tractable, using lookahead: /(?=.*a)(?=.*b) --(snip)-- (?=.*z)/.

Thank you for reply!

I didn't know about this regexp language. In what programming languages
it used (or what is the name of this regexp language itself)?

Also how /(a.*a|b.*b)/ is treated? (Let's assume two-character alphabet
{'a', 'b'} and string "abab" for which we shouldn't get a match). At
first glance this looking like "match('a', any_chars, 'a') or
match('b', any_chars, 'b')".

-- Martin


Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Charles Heywood
It's a standard regular expression in regular expression format (perhaps Perl compatible, I'm not sure). Lua has Lua patterns which are not regular expressions.

On Thu, Feb 23, 2017 at 7:57 PM Martin <[hidden email]> wrote:
On 02/23/2017 03:44 PM, Coda Highland wrote:
>>   Check that given string in lowercase contains no more than one entry
>>   of alphabet character.
>>
>>   (so all permutations of subset of alphabet [a-z] should be matched)
>
> Not hard at all. If you support backreferences in pattern strings
> (common feature) then it's just /([a-z]).*\1/. If you don't, then it's
> more verbose, but it's just /(a.*a|b.*b|c.*c| --(snip)-- |y.*y|z.*z)/.
>
>> Another example:
>>
>>   Check that given string in lowercase contains all symbols of
>>   predefined alphabet. Ignore spaces.
>>
>>   (so "the quick brown fox jumps over the lazy dog" should match)
>
> Also quite tractable, using lookahead: /(?=.*a)(?=.*b) --(snip)-- (?=.*z)/.

Thank you for reply!

I didn't know about this regexp language. In what programming languages
it used (or what is the name of this regexp language itself)?

Also how /(a.*a|b.*b)/ is treated? (Let's assume two-character alphabet
{'a', 'b'} and string "abab" for which we shouldn't get a match). At
first glance this looking like "match('a', any_chars, 'a') or
match('b', any_chars, 'b')".

-- Martin


--
--

Software Developer / System Administrator
Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Coda Highland
In reply to this post by Martin
On Fri, Feb 24, 2017 at 4:57 AM, Martin <[hidden email]> wrote:
> I didn't know about this regexp language. In what programming languages
> it used (or what is the name of this regexp language itself)?

Perl Compatible Regular Expressions. It's supported by Perl
(obviously), Python, PHP, Javascript, and thanks to libpcre pretty
much any language that can load C libraries.

> Also how /(a.*a|b.*b)/ is treated? (Let's assume two-character alphabet
> {'a', 'b'} and string "abab" for which we shouldn't get a match). At
> first glance this looking like "match('a', any_chars, 'a') or
> match('b', any_chars, 'b')".

Your reading is correct. If either of those conditions are matched,
then there's a duplicate letter in the string somewhere, and therefore
we know we should reject it. It's harder (but not impossible) to get
the inverse behavior, where the regexp matches legal strings and
rejects illegal strings; I believe you'd need a negative-lookahead
assertion.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

steve donovan
On Fri, Feb 24, 2017 at 5:11 AM, Coda Highland <[hidden email]> wrote:
> Perl Compatible Regular Expressions. It's supported by Perl
> (obviously), Python, PHP, Javascript, and thanks to libpcre pretty
> much any language that can load C libraries.

And it's larger than the whole of Lua ;)

Very powerful, very prone to write-once-and-never-read.  I've felt
that the lesser power of Lua string patterns is a blessing, not a
curse. It forces people to use if statements ;)

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Dirk Laurie-2
2017-02-24 10:14 GMT+02:00 steve donovan <[hidden email]>:

> On Fri, Feb 24, 2017 at 5:11 AM, Coda Highland <[hidden email]> wrote:
>> Perl Compatible Regular Expressions. It's supported by Perl
>> (obviously), Python, PHP, Javascript, and thanks to libpcre pretty
>> much any language that can load C libraries.
>
> And it's larger than the whole of Lua ;)
>
> Very powerful, very prone to write-once-and-never-read.  I've felt
> that the lesser power of Lua string patterns is a blessing, not a
> curse. It forces people to use if statements ;)

Lua's string handling (like so much of Lua) is a perfect compromise between
power and simplicity. I miss it so much when forced to use some regexp
dialect instead, e.g. when trying to compose a search-and-replace in vim.
The mnemonic power of alpha-coded character classes with uppercase
meaning the complementary class! The "if in doubt, escape it anyway" rule
that saves you from looking up every time whether a special character is
magic!

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Roberto Ierusalimschy
In reply to this post by Coda Highland
> Basically, regular expressions are incapable of retaining
> relationships between specific entities in the input string across
> distance. Every step of evaluation can be done knowing only what part
> of the regular expression has been evaluated so far and where you're
> currently sitting in the input string. If any more information than
> that is necessary, a regexp can't do it.

This is true for real regular expressions, not for what Perl and other
languages call "regular expressions". For instance, back references
clearly violate what real regular expressions can do. In particular,
because of back references, Perl's regexs are NP-complete: you can code
any NP problem as a regex plus a subject string (in polynomial time) and
Perl will solve the problem (not in polynomial time :-).

    http://perl.plover.com/NPC/NPC-3SAT.html

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Dirk Laurie-2
2017-02-24 14:28 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:

> because of back references, Perl's regexs are NP-complete: you can code
> any NP problem as a regex plus a subject string (in polynomial time) and
> Perl will solve the problem (not in polynomial time :-).
>
>     http://perl.plover.com/NPC/NPC-3SAT.html

To all those poets and artists who pity mathematicans and computer
scientists as soulless geeks who cannot comprehend beauty: this result,
in the words of one of your own number, is beauty bare.

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Coda Highland
In reply to this post by Roberto Ierusalimschy
On Fri, Feb 24, 2017 at 4:28 AM, Roberto Ierusalimschy
<[hidden email]> wrote:

>> Basically, regular expressions are incapable of retaining
>> relationships between specific entities in the input string across
>> distance. Every step of evaluation can be done knowing only what part
>> of the regular expression has been evaluated so far and where you're
>> currently sitting in the input string. If any more information than
>> that is necessary, a regexp can't do it.
>
> This is true for real regular expressions, not for what Perl and other
> languages call "regular expressions". For instance, back references
> clearly violate what real regular expressions can do. In particular,
> because of back references, Perl's regexs are NP-complete: you can code
> any NP problem as a regex plus a subject string (in polynomial time) and
> Perl will solve the problem (not in polynomial time :-).
>
>     http://perl.plover.com/NPC/NPC-3SAT.html
>
> -- Roberto

You are of course correct, but I have one objection, and one follow-up.

The objection: The transformation into an input string and regexp
requires work that cannot be done by a regexp match, or even a single
regexp replacement. This doesn't invalidate the proof in and of itself
but it also doesn't exactly play into the challenge.

The follow-up: While it's true that PCRE is strictly more powerful
than a finite automaton (that is, the languages it can express are
broader than the regular languages), it is not as powerful as a
pushdown automaton (that is, the languages it can express are not as
broad as the context-free languages).

In layman's terms, this means that there are languages that LPeg can
parse that PCRE cannot. (For example, PCRE cannot parse the balanced
parentheses language without calling out to Perl code.) And of course,
there are languages that LPeg cannot parse that arbitrary freeform Lua
code can.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Roberto Ierusalimschy
> In layman's terms, this means that there are languages that LPeg can
> parse that PCRE cannot. (For example, PCRE cannot parse the balanced
> parentheses language without calling out to Perl code.)

PCRE (and the Perl engine) can parse the balanced parentheses
language. Check for "(?R)".

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Coda Highland
On Fri, Feb 24, 2017 at 9:18 AM, Roberto Ierusalimschy
<[hidden email]> wrote:
>> In layman's terms, this means that there are languages that LPeg can
>> parse that PCRE cannot. (For example, PCRE cannot parse the balanced
>> parentheses language without calling out to Perl code.)
>
> PCRE (and the Perl engine) can parse the balanced parentheses
> language. Check for "(?R)".
>
> -- Roberto

Oh. Huh. I actually went and tried to research it before I posted
that, and I hadn't found a solution using that technique. TIL, I
guess! Though that's only supported by Perl, Ruby, and libpcre, not by
other PCRE-style implementations like Javascript and Python.

I don't know if there's a concrete answer to whether a regular
language with recursion is as powerful as a nondeterministic pushdown
automaton. My intuition says that they're equivalent to deterministic
PDAs, which are known to be strictly weaker than NPDAs, but in terms
of real-world parsing tasks I'm not certain if that ends up mattering.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Soni "They/Them" L.


On 24/02/17 03:18 PM, Coda Highland wrote:

> On Fri, Feb 24, 2017 at 9:18 AM, Roberto Ierusalimschy
> <[hidden email]> wrote:
>>> In layman's terms, this means that there are languages that LPeg can
>>> parse that PCRE cannot. (For example, PCRE cannot parse the balanced
>>> parentheses language without calling out to Perl code.)
>> PCRE (and the Perl engine) can parse the balanced parentheses
>> language. Check for "(?R)".
>>
>> -- Roberto
> Oh. Huh. I actually went and tried to research it before I posted
> that, and I hadn't found a solution using that technique. TIL, I
> guess! Though that's only supported by Perl, Ruby, and libpcre, not by
> other PCRE-style implementations like Javascript and Python.
>
> I don't know if there's a concrete answer to whether a regular
> language with recursion is as powerful as a nondeterministic pushdown
> automaton. My intuition says that they're equivalent to deterministic
> PDAs, which are known to be strictly weaker than NPDAs, but in terms
> of real-world parsing tasks I'm not certain if that ends up mattering.
>
> /s/ Adam
>

Can you parse HTML with LibPCRE?

I think that'd answer your question.

--
Disclaimer: these emails may be made public at any given time, with or without reason. If you don't agree with this, DO NOT REPLY.


Reply | Threaded
Open this post in threaded view
|

Re: [mind game] Seeking for simple task impossible via regexps

Roberto Ierusalimschy
In reply to this post by Coda Highland
> I don't know if there's a concrete answer to whether a regular
> language with recursion is as powerful as a nondeterministic pushdown
> automaton.

If you have recursion on subexpressions (that is, you can include
something recursive inside a larger pattern, instead of only the
whole pattern being recursive), than the answer is yes.

-- Roberto