Problem with table key iteration when string keys are unpredictable

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Problem with table key iteration when string keys are unpredictable

Paul E. Merrell, J.D.
Hi, All,

I'm working on an autoreplace script and am hoping for a tip that
might get me past a problem in unpredictability of key names. (Users
enter key/value pairs in a GUI to build a table of strings that will
replace other strings.)

Consider:

 function Replace_Substrings_in_String(s, tSubs)
  for k, v in pairs(s, tSub) do
    s = string.gsub(s, k, v)
  end
 return s
end -- function

tSubs = {
    ["---"] = "—", -- em dash
    ["--"] = "–",   -- en dash
    ["sss"] = "§", -- section
    ["ssss"] = "§§", --- sections
    ["ppp"] = "¶", -- paragraph
    ["pppp"] = "¶¶", -- paragraphs
{

s = Replace_Substrings_in_String(s, tSubs)

(Special characters are UTF-8.)

Because the order in which Lua returns non-array keys is
unpredictable, this type of substitution is problematic. For example,
if Lua returns the key for the en dash (two hyphens) before the key
for the em dash (three hyphens), the script will produce instead of an
em dash an en dash trailed by a hyphen.

I'm particularly concerned with this problem because such
abbreviations are used nearly universally by power users of word
processors and thus the likelihood is high that my users will create
them.

So my question is how I can assure that when multiple abbreviations
share the same leading sequence of identical characters, the keys are
processed in longest to shortest order?  (I don't anticipate any
problems if all keys were processed in longest to shortest order.)

Thanks in advance,

Paul

Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

Steven Johnson
Probably you would be better served just turning them into (carefully ordered) arrays, then.

Either two arrays (say, Keys and Values) that you iterate in parallel or
interleave them (if you want to keep related things next to one another) like

tSubs = {
     "---", "—", -- em dash
     "--", "–",   -- en dash
 -- etc.

for i = 1, #tSubs, 2 do
  local k, v = tSubs[i], tSubs[i + 1]
 
-- do stuff



Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

Sean Conner
In reply to this post by Paul E. Merrell, J.D.
It was thus said that the Great marbux once stated:

> Hi, All,
>
> I'm working on an autoreplace script and am hoping for a tip that
> might get me past a problem in unpredictability of key names. (Users
> enter key/value pairs in a GUI to build a table of strings that will
> replace other strings.)
>
> Because the order in which Lua returns non-array keys is
> unpredictable, this type of substitution is problematic. For example,
> if Lua returns the key for the en dash (two hyphens) before the key
> for the em dash (three hyphens), the script will produce instead of an
> em dash an en dash trailed by a hyphen.
>
> So my question is how I can assure that when multiple abbreviations
> share the same leading sequence of identical characters, the keys are
> processed in longest to shortest order?  (I don't anticipate any
> problems if all keys were processed in longest to shortest order.)

  You could do something like this:

        list = {}
        repeat
          local key,value = get_next_substitution()
          list[key] = value
          list[#list + 1] = key
        until done
        table.sort(list,function(a,b) return #a > #b end

  At the end, the array would look like:

        list =
        {
          ["---"] = "—",
          ["--"] = "–",
          ["sss"] = "§",
          ["ssss"] = "§§",
          ["ppp"] = "¶",
          ["pppp"] = "¶¶",
          [1] = "pppp",
          [2] = "ssss",
          [3] = "---",
          [4] = "ppp",
          [5] = "sss",
          [6] = '--'
        }

  With that:

        function Replace_Substrings_in_Strings(s,tSubs)
          for i = 1 , #tSubs do
            s = s:gsub(tSubs[i],tSubs[tSubs[i]])
          end
          return s
        end

  I also did something similar in LPeg, which goes through the string once.
I describe it here:

        http://boston.conman.org/2013/01/14.1

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

Tim Hill
In reply to this post by Paul E. Merrell, J.D.
As you state, in general you want to process the longest keys first. The simplest way to do this is to construct an auxiliary table containing an array where the values are the keys, and then sort this array by length of string (which is just a normal sort, reverse order). Then, index through THIS array when doing the substitution (which will give you the keys, getting the value is easy).

For example:

tSubs = { …. } -- Fill in the strings table as you currently have it
tSubsArray = {}
for k, v in pairs(tSubs) do
        table.insert(tSubsArray, k)
end
table.sort(tSubsArray) -- Sorts strings shortest first

function Replace_Substrings_in_String(s, tSubs, tSubsArray)
        for ix = #tSubsArray, 1, -1 do -- Reverse order so longest strings (keys) first
                local k = tSubsArray[ix]
                local v = tSubs[k]
                -- Now you have k,v the rest is the same
        end
        return s
end


On May 20, 2013, at 7:11 PM, marbux <[hidden email]> wrote:

> Hi, All,
>
> I'm working on an autoreplace script and am hoping for a tip that
> might get me past a problem in unpredictability of key names. (Users
> enter key/value pairs in a GUI to build a table of strings that will
> replace other strings.)
>
> Consider:
>
> function Replace_Substrings_in_String(s, tSubs)
>  for k, v in pairs(s, tSub) do
>    s = string.gsub(s, k, v)
>  end
> return s
> end -- function
>
> tSubs = {
>    ["---"] = "—", -- em dash
>    ["--"] = "–",   -- en dash
>    ["sss"] = "§", -- section
>    ["ssss"] = "§§", --- sections
>    ["ppp"] = "¶", -- paragraph
>    ["pppp"] = "¶¶", -- paragraphs
> {
>
> s = Replace_Substrings_in_String(s, tSubs)
>
> (Special characters are UTF-8.)
>
> Because the order in which Lua returns non-array keys is
> unpredictable, this type of substitution is problematic. For example,
> if Lua returns the key for the en dash (two hyphens) before the key
> for the em dash (three hyphens), the script will produce instead of an
> em dash an en dash trailed by a hyphen.
>
> I'm particularly concerned with this problem because such
> abbreviations are used nearly universally by power users of word
> processors and thus the likelihood is high that my users will create
> them.
>
> So my question is how I can assure that when multiple abbreviations
> share the same leading sequence of identical characters, the keys are
> processed in longest to shortest order?  (I don't anticipate any
> problems if all keys were processed in longest to shortest order.)
>
> Thanks in advance,
>
> Paul
>


Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

S. Fisher
In reply to this post by Paul E. Merrell, J.D.


--- On Mon, 5/20/13, marbux <[hidden email]> wrote:


> I'm working on an autoreplace script and am hoping for a tip
> that
> might get me past a problem in unpredictability of key
> names. (Users
> enter key/value pairs in a GUI to build a table of strings
> that will
> replace other strings.)
>
> Consider:
>
>  function Replace_Substrings_in_String(s, tSubs)
>   for k, v in pairs(s, tSub) do
>     s = string.gsub(s, k, v)
>   end
>  return s
> end -- function
>
> tSubs = {
>     ["---"] = "—", -- em dash
>     ["--"] = "–",   -- en dash
>     ["sss"] = "§", -- section
>     ["ssss"] = "§§", --- sections
>     ["ppp"] = "¶", -- paragraph
>     ["pppp"] = "¶¶", -- paragraphs
> {
>
> s = Replace_Substrings_in_String(s, tSubs)
>
> (Special characters are UTF-8.)
>
> Because the order in which Lua returns non-array keys is
> unpredictable, this type of substitution is problematic. For
> example,
> if Lua returns the key for the en dash (two hyphens) before
> the key
> for the em dash (three hyphens), the script will produce
> instead of an
> em dash an en dash trailed by a hyphen.
>

> So my question is how I can assure that when multiple
> abbreviations
> share the same leading sequence of identical characters, the
> keys are
> processed in longest to shortest order?  (I don't
> anticipate any
> problems if all keys were processed in longest to shortest
> order.)

function escape( s )
  return string.gsub(s, "[][^$()%.*+-?]", "%%%0" )
end

function substitute(s, substitutions)
  sorted = {}
  for k, _ in pairs( substitutions ) do
    table.insert( sorted, k )
  end
  table.sort( sorted, function (a,b) return #a > #b end )
  for _, key in ipairs( sorted ) do
    s,cnt = string.gsub(s, escape(key), substitutions)
  end
  return s
end


Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

S. Fisher


--- On Tue, 5/21/13, S. Fisher <[hidden email]> wrote:


>
> function substitute(s, substitutions)
>   sorted = {}
>   for k, _ in pairs( substitutions ) do
>     table.insert( sorted, k )
>   end
>   table.sort( sorted, function (a,b) return #a > #b
> end )
>   for _, key in ipairs( sorted ) do
>     s,cnt = string.gsub(s, escape(key),
> substitutions)
>   end
>   return s
> end
>

function substitute(s, substitutions)
  local sorted = {}
  for k, _ in pairs( substitutions ) do
    table.insert( sorted, k )
  end
  table.sort( sorted, function (a,b) return #a > #b end )
  for _, key in ipairs( sorted ) do
    s = string.gsub(s, escape(key), substitutions)
  end
  return s
end



Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

Philipp Janda
In reply to this post by S. Fisher
Am 21.05.2013 12:48 schröbte S. Fisher:
>
> function escape( s )
>    return string.gsub(s, "[][^$()%.*+-?]", "%%%0" )
> end
>

Try:

     > =escape( "12345;:/<>%" )
     %1%2%3%4%5%;%:%/%<%>% 10

which is probably not what you wanted. So you need to put a few more
escapes in there (at least before `%`, `.`, and `-`) ...

Philipp


p.s.: Maybe we should have a string.escape() function in Lua's standard
library (or at least an example in the ref manual ready for copy &
paste) ...



Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

S. Fisher
--- On Tue, 5/21/13, Philipp Janda <[hidden email]> wrote:

> Am 21.05.2013 12:48 schröbte S.
> Fisher:
> >
> > function escape( s )
> >    return string.gsub(s, "[][^$()%.*+-?]",
> "%%%0" )
> > end
> >
>
> Try:
>
>     > =escape( "12345;:/<>%" )
>     %1%2%3%4%5%;%:%/%<%>%   
> 10
>
> which is probably not what you wanted. So you need to put a
> few more escapes in there (at least before `%`, `.`, and
> `-`) ...

function escape( s )
  return string.gsub(s, "[][^$()%%.*+?-]", "%%%0" )
end

> =escape( "1234-5;:/<>.%" )
1234%-5;:/<>%.%%        3


Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

Sean Conner
In reply to this post by Philipp Janda
It was thus said that the Great Philipp Janda once stated:

> Am 21.05.2013 12:48 schröbte S. Fisher:
> >
> >function escape( s )
> >   return string.gsub(s, "[][^$()%.*+-?]", "%%%0" )
> >end
> >
>
> Try:
>
>     > =escape( "12345;:/<>%" )
>     %1%2%3%4%5%;%:%/%<%>% 10
>
> which is probably not what you wanted. So you need to put a few more
> escapes in there (at least before `%`, `.`, and `-`) ...
>
> Philipp
>
>
> p.s.: Maybe we should have a string.escape() function in Lua's standard
> library (or at least an example in the ref manual ready for copy &
> paste) ...

  Or build up an LPeg expression: http://boston.conman.org/2013/01/14.1

  -spc


Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

Petite Abeille
In reply to this post by Philipp Janda

On May 21, 2013, at 1:24 PM, Philipp Janda <[hidden email]> wrote:

> which is probably not what you wanted. So you need to put a few more escapes in there (at least before `%`, `.`, and `-`) …

Quoting the manual… "Any punctuation character (even the non magic) can be preceded by a '%' when used to represent itself in a pattern."… so… perhaps this can all be simplified to just gsub( '%p', '%%%0' ) )…

For example:

print( ( '1234-5;:/<>.%' ):gsub( '%p', '%%%0' ) )

> 1234%-5%;%:%/%<%>%.%% 8

http://www.lua.org/manual/5.2/manual.html#6.4.1


Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

Paul E. Merrell, J.D.
My thanks to you all. I will test each of these methods and go with
the one that seems the most foolproof, given other code in the script.

A special thanks for noticing that the script also needs to escape
magic characters, which I had not yet caught.

Best regards,

Paul

Reply | Threaded
Open this post in threaded view
|

Re: Problem with table key iteration when string keys are unpredictable

Dirk Laurie-2
2013/5/22 marbux <[hidden email]>:
> My thanks to you all. I will test each of these methods and go with
> the one that seems the most foolproof, given other code in the script.
>
> A special thanks for noticing that the script also needs to escape
> magic characters, which I had not yet caught.

Since nobody has up to now supported Sean's suggestion of LPEG,
I'll do so now. I recently put in the required amount of activation energy
to get my LPEG competence going, and the reaction is now exothermic.
In non-chemical terms, there's a steep learning curve but once you
reach the plateau you just keep going effortlessly.

Building a pattern that tries several possibilities in a prescribed order
is one of the most basic LPEG constructs: a1+a2+a3+...+an

Prescribing the replacement is hardly less basic:

a1/b1

Saying "move up one character if there is no match" is easy too

a1+1

Repeat indefinitely:

a1^0

Do the substitutions and leave the rest unchanged:

lpeg.Cs(P)

E.g.

    Cs, P = lpeg.Cs, lpeg.P
    pat = Cs((P"ABC"/"abc" + P"AB"/"de" + P"A"/"f"+P(1)/".")^0)
    print (pat:match"ABCBAABCABABCABCBABCBCBAB")
abc.fabcdeabcabc.abc...de