Filtering chars from string

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Filtering chars from string

Scott Morgan
What's the best way to filter out non-specified chars from an ASCII
string? (so no need to worry about UTF8, etc.)

e.g.:

local allowed = "AbC"
local txt = "dDcCbBaA"

filter(txt, allowed, "") -- ret: "CbA"
filter(txt, allowed, "?") -- ret: "???Cb??A"

For extra difficulty, plain Lua 5.1, no modules (lpeg, etc.)

I can see a way using gsub(txt, ".", filter_func), but was wondering if
there are any tricks that might make it simpler.

Scott

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Luiz Henrique de Figueiredo
On Tue, Aug 27, 2019 at 5:55 AM Scott Morgan <[hidden email]> wrote:
>
> What's the best way to filter out non-specified chars from an ASCII
> string? (so no need to worry about UTF8, etc.)
>
> local allowed = "AbC"
> local txt = "dDcCbBaA"
>
> filter(txt, allowed, "") -- ret: "CbA"
> filter(txt, allowed, "?") -- ret: "???Cb??A"

function filter(t,a,r)
   print((t:gsub("[^"..a.."]",r)))
end

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Francisco Olarte
In reply to this post by Scott Morgan
Scott:

On Tue, Aug 27, 2019 at 10:55 AM Scott Morgan <[hidden email]> wrote:
> What's the best way to filter out non-specified chars from an ASCII
> string? (so no need to worry about UTF8, etc.)

> filter(txt, allowed, "") -- ret: "CbA"
> filter(txt, allowed, "?") -- ret: "???Cb??A"
>
> For extra difficulty, plain Lua 5.1, no modules (lpeg, etc.)
>
> I can see a way using gsub(txt, ".", filter_func), but was wondering if
> there are any tricks that might make it simpler.

With your examples it seems building a char-set from the allowed chars
and using gsub will do the trick, i.e.:

string.gsub(txt, build_charset(allowed), repl)

> string.gsub("12345678", "[^246]", "")
246    5
> string.gsub("12345678", "[^246]", "?")
?2?4?6??    5

Then, to build the charset, you just have to filter a couple of
special chars ( take care of detecting % and - in allowed and escape
them ) ( I'm not too sure if you need to escape a leading ^ too, but a
couple oneliners would reveal it ).

Francisco Olarte.

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Scott Morgan
In reply to this post by Luiz Henrique de Figueiredo
On 27/08/2019 10:44, Luiz Henrique de Figueiredo wrote:

> On Tue, Aug 27, 2019 at 5:55 AM Scott Morgan <[hidden email]> wrote:
>>
>> What's the best way to filter out non-specified chars from an ASCII
>> string? (so no need to worry about UTF8, etc.)
>>
>> local allowed = "AbC"
>> local txt = "dDcCbBaA"
>>
>> filter(txt, allowed, "") -- ret: "CbA"
>> filter(txt, allowed, "?") -- ret: "???Cb??A"
>
> function filter(t,a,r)
>    print((t:gsub("[^"..a.."]",r)))
> end
>

Doh! Knew I was missing something simple, hence the question.

Is there a limit to how big those sets can be? In practice I'd be
looking at a few dozen allowed chars.

Scott

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Luiz Henrique de Figueiredo
> Is there a limit to how big those sets can be? In practice I'd be
> looking at a few dozen allowed chars.

No limit.
But as Francisco Olarte said, you may need to escape magic characters.
See http://www.lua.org/manual/5.3/manual.html#6.4.1

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

nobody
In reply to this post by Scott Morgan
On 27/08/2019 14.46, Scott Morgan wrote:

> On 27/08/2019 10:44, Luiz Henrique de Figueiredo wrote:
>> On Tue, Aug 27, 2019 at 5:55 AM Scott Morgan <[hidden email]> wrote:
>>>
>>> What's the best way to filter out non-specified chars from an ASCII
>>> string? (so no need to worry about UTF8, etc.)
>>>
>>> local allowed = "AbC"
>>> local txt = "dDcCbBaA"
>>>
>>> filter(txt, allowed, "") -- ret: "CbA"
>>> filter(txt, allowed, "?") -- ret: "???Cb??A"
>>
>> function filter(t,a,r)
>>     print((t:gsub("[^"..a.."]",r)))
>> end
>>
>
> Doh! Knew I was missing something simple, hence the question.
>
> Is there a limit to how big those sets can be? In practice I'd be
> looking at a few dozen allowed chars.

Even if you put in all possible byte values it works:

     s = "["
     for i = 0, 255 do  s = s .. "%" .. string.char( i )  end
     s = s .. "]"
     for m in ("xyzzy"):gmatch( s ) do  print( m )  end
     --> x y z z y
     for m in ("xyzzy"):gmatch( "[^"..s:sub(2,-1) ) do  print( m )  end
     --> (nothing)

(And yes always putting in the escaping '%' works fine and is probably
safer/simpler than a bunch of special cases if you're generating the set…)

-- nobody

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Francisco Olarte
In reply to this post by Scott Morgan
Scott:

On Tue, Aug 27, 2019 at 2:46 PM Scott Morgan <[hidden email]> wrote:
> Is there a limit to how big those sets can be? In practice I'd be
> looking at a few dozen allowed chars.

Its trivially easy to test. Assuming you only use the printable subset
of the ascii set you wanted to use:

Lua 5.3.3  Copyright (C) 1994-2016 Lua.org, PUC-Rio
> a=''; for i=32,126 do a=a..string.char(i) end; print(a)
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
> b=a:gsub("([%%%-%]])","%%%1"); print(b);
 !"#$%%&'()*+,%-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\%]^_`abcdefghijklmnopqrstuvwxyz{|}~
> print(a:gsub("[^"..b.."]","x"))
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
   0
> print(a:gsub("["..b.."]","x"))
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
   95

I do not have a 5.1 handy, but you can test it easily, and given how
char classes are usually made ( I haven't looked in lua, but I've seen
several ), they normally are unlimited, specially when dealing with
bytes.

Francisco Olarte.

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Roberto Ierusalimschy
In reply to this post by nobody
> (And yes always putting in the escaping '%' works fine and is probably
> safer/simpler than a bunch of special cases if you're generating the set…)

You can safely escape any non-alphanumeric character. Escaping
some alphanumeric characters changes their meanings.

    > string.gsub("abc", "[a]", print);
    a
    > string.gsub("abc", "[%a]", print);
    a
    b
    c

The following line should do the trick:

  s = string.gsub(s, "%W", "%%%0")

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Luiz Henrique de Figueiredo
In reply to this post by nobody
> (And yes always putting in the escaping '%' works fine and is probably
> safer/simpler than a bunch of special cases if you're generating the set…)

You should not escape all letters because some of them represent
classes. For instance "[%d]" is the same as "[0123456789]". You can
escape all punctuation.

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Viacheslav Usov
In reply to this post by Scott Morgan
On Tue, Aug 27, 2019 at 2:46 PM Scott Morgan <[hidden email]> wrote:

> Is there a limit to how big those sets can be? In practice I'd be looking at a few dozen allowed chars.

I would rather ask about the complexity/efficiency of this. I may be mistaken, but the manual seems silent about this.

Cheers,
V.

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Scott Morgan
In reply to this post by Roberto Ierusalimschy
On 27/08/2019 15:17, Roberto Ierusalimschy wrote:
> The following line should do the trick:
>
>   s = string.gsub(s, "%W", "%%%0")
>

Neat, but where is '%W' documented?

Scott

v
Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

v
On Tue, 2019-08-27 at 16:38 +0100, Scott Morgan wrote:
> On 27/08/2019 15:17, Roberto Ierusalimschy wrote:
> Neat, but where is '%W' documented?
>
> Scott

https://www.lua.org/manual/5.3/manual.html:
> %w: represents all alphanumeric characters.
> For all classes represented by single letters (%a, %c, etc.), the
corresponding uppercase letter represents the complement of the class.

--
v <[hidden email]>


Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Francisco Olarte
In reply to this post by Viacheslav Usov
On Tue, Aug 27, 2019 at 5:05 PM Viacheslav Usov <[hidden email]> wrote:
> On Tue, Aug 27, 2019 at 2:46 PM Scott Morgan <[hidden email]> wrote:
> > Is there a limit to how big those sets can be? In practice I'd be looking at a few dozen allowed chars.
> I would rather ask about the complexity/efficiency of this. I may be mistaken, but the manual seems silent about this.

And you'll be right to ask it. It seems to be interpreted by a compact
loop inside the lib, not the fastest, but should be adeuqate unless he
has a very demanding app ( hot loop, long strings, ... ). I wrote a
previous message thinking on other pattern engines I've used ( which
compile the pattern, i.e., I've used some which compile a class to
min-char, max-char, bit mask or similar stuff ), then I remembered lua
is simple, so I said "The Source Olarte, use The Source" and looked it
up. The relevant function should be fast, but a hot spot may benefit a
lot from some character class optimization ( like collapsing
abcdefghijk to a-k and similar stuff ), otherwise the runtime seems
proportional to len(class)*len(string).

Francisco Olarte.

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Gé Weijers
In reply to this post by Scott Morgan


On Tue, Aug 27, 2019 at 1:55 AM Scott Morgan <[hidden email]> wrote:
What's the best way to filter out non-specified chars from an ASCII
string? (so no need to worry about UTF8, etc.)


There are ways to do this without having to escape chars in regular expressions, and get fairly decent performance.

'make_filter' returns a table/object that does the filtering. The table is used as a cache to store per-character results. The function assigned to __index
has to be evaluated only once for each character value, so it's never called more than 256 times. The actual substitution work is done by a single call to 'string.gsub' with the table as its third parameter.

local function make_filter(valid_chars)
        return setmetatable({}, {
                __index = function(t, c)
                        local r = string.find(valid_chars, c, 1, true) and c or ""
                        t[c] = r
                        return r
                end,

                __call = function(t, s)
                        return (string.gsub(s, '.', t))
                end
        })
end

local filter = make_filter("abc[]")

print(filter("[abcABCabc]"))

--

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

szbnwer@gmail.com
hi there! :)


Roberto Ierusalimschy:

> The following line should do the trick:
>  s = string.gsub(s, "%W", "%%%0")


my solution was:
`string.escPattern=function(str) return str:gsub('%p', '%%%0') end`

(((im lying, it was actually:
`string.escPattern=function(str) return str:gsub('(%p)', '%%%1') end`
but the other variant looks better, and i think it should be more
optimal and micro optimization is important when the ice caps are
already melting! just think about those poor penguins and linux! :D
)))

that means less chars to escape (white spaces and control charsr), and
a non-inverted char class is maybe (i guess) faster as it is more
straightforward

and this should be legit, as the manual explicitly allows this:

"%x: (where x is any non-alphanumeric character) represents the
character x. This is the standard way to escape the magic characters.
Any punctuation character (even the non magic) can be preceded by a
'%' when used to represent itself in a pattern."

(so it says `'%W'` is legit, but the last sentence is the point, and
every magic chars are in `'%p'`, and i dont even believe that anything
else will ever become a magic char than punctuation chars...)


all the bests to all of u! :)

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

nobody
In reply to this post by nobody
On 27/08/2019 16.17, Roberto Ierusalimschy wrote:
> You can safely escape any non-alphanumeric character. Escaping some
> alphanumeric characters changes their meanings.

On 27/08/2019 16.26, Luiz Henrique de Figueiredo wrote:
> You should not escape all letters because some of them represent
> classes. For instance "[%d]" is the same as "[0123456789]". You can
> escape all punctuation.

Right, both character classes (% + letter) and captures (% + digit) are
a thing.  I shouldn't be writing "quick mails" when mentally impaired.
(It's effin' hot outside…)

Sorry.

-- nobody

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Philippe Verdy
In reply to this post by szbnwer@gmail.com
If you think about microoptimizations, then using %p to escape all punctuations, instead of just those used in character classes will in fact escape too many characters, creating a longer pattern that will compile more slowly and will use more memory than needed. for then creating the final pattern for substitutions.

In fact, there are very few (only 4) characters that need escaping in character classes: only '%', '-', ']' and '^' (note also that '^' is special only at start of the character class, and '-' is special only between two other characters possibly escaped, you can avoid escaping them by placing them at non-special positions in the class pattern, if you write the pattern manually, but for generated patterns, you can avoid this complication and just stick to these 4):

string.escPattern=function(str) return str:gsub('[%%%-%]%^]', '%%%0') end`  

So to generate a generic character class from an arbitrary set of (ASCII) characters specified in a string, you only need to escape these four ones and surround the whole with '[]' to generate the character class pattern. you may also optionally want to avoid including the same characters multiple times in the generated pattern, but the code is more complex, and in my opinion this code exists (and is more efficient) within the native implementation of Lua patterns in the Lua library.

Note that the suggestion to use "%p" does not work as it doesn't escape the "^" and "%" (which are not punctuations like "-" and "]", they are symbols) !!!
So this is also wrong:
string.escPattern=function(str) return str:gsub('%p', '%%%0') end`  

Once this generated character class will be compiled (fast), it will also have the same runtime independantly of the text on which you'll search the pattern and optionally make replacements: the time does not depend at all on the "length" of the character class (i.e. the number of characters it matches) as the compiled regexp will become an internal array/vector/bitmap where each character of the text to parse will be looked up by a single direct access (after an initial range check for the index). You can use the character class "%p" or "%P" or "[A]" or "[A-QTZ]" or ".", the runtime performance of searches/replace will be the same as they will all match a single character. So the actual final performance (after compilation of the pattern) will be identical. If you consider the compilation time, it does not depend at all on the length text to scan but on the length of the pattern itself, and you should minimize it: a pattern "[A-Z]" is then faster to compile than "[ABCDEFGHIJKLMNOPQRSTUVWXYZ]", and as well the transformation of the set of characters in a string ".+-?" into the pattern "[.+%-?]" will be better (if you just escape the 4 special characters) than the character class pattern "[%.+%-%?]" which has two unnecessary escapes for "." and "?".


Le mar. 27 août 2019 à 20:58, [hidden email] <[hidden email]> a écrit :
hi there! :)


Roberto Ierusalimschy:

> The following line should do the trick:
>  s = string.gsub(s, "%W", "%%%0")


my solution was:
`string.escPattern=function(str) return str:gsub('%p', '%%%0') end`

(((im lying, it was actually:
`string.escPattern=function(str) return str:gsub('(%p)', '%%%1') end`
but the other variant looks better, and i think it should be more
optimal and micro optimization is important when the ice caps are
already melting! just think about those poor penguins and linux! :D
)))

that means less chars to escape (white spaces and control charsr), and
a non-inverted char class is maybe (i guess) faster as it is more
straightforward

and this should be legit, as the manual explicitly allows this:

"%x: (where x is any non-alphanumeric character) represents the
character x. This is the standard way to escape the magic characters.
Any punctuation character (even the non magic) can be preceded by a
'%' when used to represent itself in a pattern."

(so it says `'%W'` is legit, but the last sentence is the point, and
every magic chars are in `'%p'`, and i dont even believe that anything
else will ever become a magic char than punctuation chars...)


all the bests to all of u! :)

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

Philippe Verdy
Le mar. 27 août 2019 à 22:43, Philippe Verdy <[hidden email]> a écrit :
If you think about microoptimizations, then using %p to escape all punctuations, instead of just those used in character classes will in fact escape too many characters, creating a longer pattern that will compile more slowly and will use more memory than needed. for then creating the final pattern for substitutions.

In fact, there are very few (only 4) characters that need escaping in character classes: only '%', '-', ']' and '^' (note also that '^' is special only at start of the character class, and '-' is special only between two other characters possibly escaped, you can avoid escaping them by placing them at non-special positions in the class pattern, if you write the pattern manually, but for generated patterns, you can avoid this complication and just stick to these 4):

string.escClassPattern=function(str) return str:gsub('[%%%-%]%^]', '%%%0') end`

And you can also reduce this function because '-' and '^' are not special in all positions of the character class:
string.escClassPattern=function(str) return str:gsub('[-%%%]^]', '%%%0') end`

A generic pattern escaping function that can be used to match arbitrary literal strings is similar but has to escape eight characters: "$", ".", "?", "*", "+", "(", ")", "[", and "\" in addition to the 4 previous ones. You'll note that not all these are matched by "%p" (because they are also not all punctuations.)

string.escLiteralPattern=function(str) return str:gsub('[-$%%()*+.?[\\%]^]', '%%%0') end`

Note that above, no need to escape in the class pattern any other characters than "%" and "]", because all other characters in the class are interpreted as litterals. Note that \ is not escaped in the pattern style with a '%' but in the Lua string syntax with a '\'  (this '\' escape disappears completely at runtime, unlike '%' which remains to compile the pattern at runtime).

The two functions above cannot be unified into a single one, they are not used at all for the same purpose: the first one will be used to build a character class matching a single character, the second one will be used to build a pattern matching a multicharacter string literal.
Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

szbnwer@gmail.com
hi Philippe! :)


> In fact, there are very few (only 4) characters that need escaping in character classes: only '%', '-', ']' and '^' (note also that '^' is special only at start of the character class, and '-' is special only between two other characters possibly escaped, you can avoid escaping them by placing them at non-special positions in the class pattern, if you write the pattern manually, but for generated patterns, you can avoid this complication and just stick to these 4):

> string.escPattern=function(str) return str:gsub('[%%%-%]%^]', '%%%0') end`

thats right about escaping char sets, except that 5.1 (not above)
needs `'\0'` --> `'%z'` as well, but i reflected to Roberto's stuff
actually, but not exactly to Scott (op) :D otherwise all the magic
should be escaped...

[side note, but still can be interesting :D ]
my use case was to match `'...veryLongTrimmedPath/whateverFile.lua'`
(without the `'...'`) from the error messages to files that my
monkeypatched `require()` (and `loadfile()` and `dofile()`, but i dont
use them if im right) registering (by looking up possible paths from
`package.path`; `package.cpath` isnt important here, maybe nasty
source code lookup for function definitions could make it interesting)
when the files are loaded, so i can print a few lines with the
tracebacks. cuz simple things like `'.'` in the paths could give some
side effects. however later i discovered that the 4th arg of
`string.find()` can do the trick, even if it gives me the position
instead of the substring, but i think thats still a cheaper way to go,
but still more straightforward... ive made recently a function that
can search in strings/files (i mentioned it earlier somewhere with a
few more details) and i was happy that i already had this simple gem
around, and then i realized the `string.find()` stuff that was fine
also there, and even better, cuz it plays better with sub-patterns, as
i needed something like `'()('.. term.. ')()'` to get the position...
so this invalidated my use cases, but i still kept that nice one-liner
gem. :D
[/side note]

branching for `'^'`, `'-'` and whatever can take away some
performance, i think, but i didnt test it


> Note that the suggestion to use "%p" does not work as it doesn't escape the "^" and "%" (which are not punctuations like "-" and "]", they are symbols) !!!

from the point of view of lua (inherited from c (`ispunct()`)), they are :D


> Once this generated character class will be compiled (fast), it will also have the same runtime independantly of the text on which you'll search the pattern and optionally make replacements: the time does not depend at all on the "length" of the character class (i.e. the number of characters it matches) as the compiled regexp will become an internal array/vector/bitmap where each character of the text to parse will be looked up by a single direct access (after an initial range check for the index).

that sounds good, and also possible, maybe luajit (and/or lpeg and/or
5.4) does that, but its not the case with 5.1-5.3, see:
https://www.lua.org/source/5.1/lstrlib.c.html#matchbracketclass and
`match_class()` right above, otherwise `isalpha()` and the like maybe
does that on individual char classes, but i think they are just doing
some range checks. i have some notes about this topic, but i forgot to
record the source, maybe wikipedia gave it:
"#define isdigit(x) ((x) >= '0' && (x) <= '9')
This can cause problems if x has a side effect---for instance, if one
calls isdigit(x++) or isdigit(run_some_program())"
and i think that these works the same but they arent implemented as
macros anymore

btw i like the idea, a bool array from 0 to 255 could really do the
trick, but it would require 256bytes for every char set, so, on the
contrary, thats not memory efficient...


> string.escLiteralPattern=function(str) return str:gsub('[-$%%()*+.?[\\%]^]', '%%%0') end`

you could still utilize `'%p'` here if not all the magic chars would
be covered, but that have side effects, so this one is only for an
another variation :D utilizing `'%p'` is still the most simple and
straightforward in most cases, but there are cases where precision
would require this latter version that you made, so it actually has
its pure right to exist! :D


bests! :)

Reply | Threaded
Open this post in threaded view
|

Re: Filtering chars from string

szbnwer@gmail.com
> string.escLiteralPattern=function(str) return str:gsub('[-$%%()*+.?[\\%]^]', '%%%0') end`

i would write this as:
`string.escLiteralPattern=function(str) return
str:gsub('[-^$()%%.[%]*+?]', '%%%0') end`
as this preserves the order of the magic chars from the manual, except
for the `'-'`, and the backslash doesnt needed to be escaped if im
right, or at least it isnt a magic character, and it cant have any
side effects, but im not 100% sure :D but the point of not using
`'%p'` is the precise escaping, so ive just thought that its better to
leave out the backslash from the game here...

otherwise embedded zeroes could still fool us, as patterns in 5.1
doesnt like that, but that would require a 2nd run for `gsub()`

so play this:
`string.escLiteralPattern=function(str) return str:gsub('%z',
'%%%z'):gsub('[-^$()%%.[%]*+?]', '%%%0') end`

and this just mentally overwhelmed me (a.m. 2:30 here) for now and im
not sure if im right about the embedded zeroes as well, so dont take
this for sure, but count upon it! :D anyways, if this matters, then it
must be applied to the `'%p'` case as well...


clearance is welcome! :)

12