split and splitlines (in 5.3)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

split and splitlines (in 5.3)

Eduardo Ochs
Hi list,

I started using Lua 5.3 for some things and I noticed that the
function "splitlines" in my LUA_INIT file doesn't work in 5.3. Here
its definition (warning: I wrote these functions 15+ years ago!
Ugly!):

  split = function (str, pat)
      local arr = {}
      string.gsub(str, pat or "([^%s]+)", function (word)
          table.insert(arr, word)
        end)
      return arr
    end
  splitlines = function (bigstr)
      local arr = split(bigstr, "([^\n]*)\n?")
      table.remove(arr)
      return arr
    end

It seems that the behavior of string.gsub changed from 5.2 to 5.3. In
5.1 and 5.2 the

  arr = split(bigstr, "([^\n]*)\n?")

would always generate an sequence with an empty string at its end, and
table.remove would delete it.

I remember ___veeeery___ vaguely a discussion here in the list about a
change in string.gsub that made it stop substituting the empty sting
at the end, but I'm not being able to find it now... anyone has a
pointer to it? I don't know how to fix my code in an elegant way to
make it work in 5.2 and 5.3 and I would like to take a look at the
discussion...

  Thanks in advance,
    Eduardo Ochs
    http://angg.twu.net/dednat6.html

Reply | Threaded
Open this post in threaded view
|

Re: split and splitlines (in 5.3)

szbnwer@gmail.com
hi there! :)

im on luajit (i think 2.0.5, but makes no difference if im right) and
i just wanted to ask the same yesterday, cuz currently i have 4
solutions:

`(str.. '\n'):gsub'([^\n]*)\n'` that does the right thing (minus
instead of the asterisk also works iirc), but i think that
concatenation can be expensive with long strings, like an instant
duplication for its memory, and it should also go through the whole
string to do so, but it works... i use this for making a string
iterator, where the string can have an arbitrary length, so its not
really good, so i would like to eliminate this in the future.

the 2nd approach is `for p in package.path:gmatch'[^;]+;?'` that is
similar, while it wasnt sufficient elsewhere, if im right.

[sidenote]
this is used because i print some lines of the sources around the
given lines in the tracebacks, but those are trimmed if they are too
long, and this is used in a wrapper for `require()` to register the
full paths, so if two or more trimmed paths collide, then i print
lines for all of them, and i came up recently with the idea that
`debug.traceback()` could be hacked, as `debug.getinfo()` can obtain
the full paths, but actually errors doesnt utilize an overridden
`debug.traceback()`, so this cant be hacked nicely from lua side ...
sad story. (please tell me if im wrong here! :) )
[/sidenote]

the 3rd one is to not use `str:gsub()`, like:
`for i=1, #str do if str:sub(i, i)=='\n' then do whatever() end end`
but i dont care about win and mac line endings, so its not the best
for everyone, but fine for what i want, and its still hackable for
those platforms. in this case, you could actually pass a whole
function instead of a pattern, or even both can be handled (like:
`insertAFunctionHere(type(stuff)=='function' and stuff(str) or
str:gsub(type(stuff)=='string' and stuff or
'whateverDefaultPattern'))`). i think its basically cheap, and only
`str:sub()` is heavily used here, but i think its still cheaper than
`gsub()`.

[not so important part to this latter]
actually i use this only to get the offset of the lines in a string
for a freshly made/almost done searcher, that works on files and
strings, and it can iterate over either lines (that just isnt
sufficient in every case; and the iterator is made with the previous
technique, and the above mentioned `for i=1, #str do` approach is used
in the next two cases only), or fixed size chunks of text (like for
huge files, but that can leave out matches at the cuts similarly to
the previous case, but thats more reliable if `'\n'` isnt involved,
and with regular-sized chunks in the file, this can be also fine), or
it can use the whole text, that is the default and most safe, but the
size could become a problem. (suggestions are welcome! :D ) and
actually it can use strings or patterns both for capturing, and i
think i will handle the string case to work in every case with a
custom match function, but patterns wouldnt be happy with cuts, and i
also dont really wanna make a whole engine for the patterns just to
handle this, but i hope for not facing the day when i wont be able to
handle anything with these possibilities :D
[/not so important part to this latter]

yesterday ive found a 4th, another halfway sufficient solution, that
can cover some use cases, that is:
`str:gsub'[^\n]+'` but this would skip some empty lines, that isnt
good for searching, where line numbers are important, but i will
change the pattern for that `package.path` stuff if a benchmark will
tell me to do it so :D


[somewhat related thoughts]
im not really into using metatables (but recently ive found my 1st use
case for my near future plans :D ), so i checked out (without much
hope) if there is anything that could be used to make a table behave
like a string and make it indexable, so it could be given to any
string functions, but i think theres no way to do it so, but if we
would have `str[from:to]` or even just `str[idx]` like python (thats
actually a nice thing there), then it could be solved with related
metamethods, but now only wrappers could solve this, but that wont
worth... btw my idea was to obtain continuation of the string when its
needed, so pattern matching would work without feeding it anything
like a few gib's of a binary file or whatever like...
[/somewhat related thoughts]

but after all, these caveats couldnt take lua any farer from my
hearth, in general, no other fancy language could competite so far now
:D

all the bests, and thx for any help around the mentioned problems! :)

Reply | Threaded
Open this post in threaded view
|

Re: split and splitlines (in 5.3)

szbnwer@gmail.com
ahh sorry, ive just reread your original message, and there is a
chance that i misunderstood your original question, but sorry if i
talked about too much different stuffs and only stole your time! :D

Reply | Threaded
Open this post in threaded view
|

Re: split and splitlines (in 5.3)

Roberto Ierusalimschy
In reply to this post by Eduardo Ochs
> It seems that the behavior of string.gsub changed from 5.2 to 5.3. In
> 5.1 and 5.2 the
>
>   arr = split(bigstr, "([^\n]*)\n?")
>
> would always generate an sequence with an empty string at its end, and
> table.remove would delete it.
>
> I remember ___veeeery___ vaguely a discussion here in the list about a
> change in string.gsub that made it stop substituting the empty sting
> at the end, but I'm not being able to find it now... anyone has a
> pointer to it?

- http://lua-users.org/lists/lua-l/2013-04/msg00812.html
- http://lua-users.org/lists/lua-l/2016-05/msg00198.html

See also commit 783aa8a9da5f3.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: split and splitlines (in 5.3)

Eduardo Ochs
Perfect!
Thanks!!!
By the way I am using this solution/workaround:

  split = function (str, pat)
      local arr = {}
      string.gsub(str, pat or "([^%s]+)", function (word)
          table.insert(arr, word)
        end)
      return arr
    end
  splitlines = function (bigstr)
      local arr = split(bigstr, "([^\n]*)\n?")
      if _VERSION:sub(5) < "5.3" then
        table.remove(arr)
      end
      return arr
    end

Now dednat6 works with versions of lualatex that use Lua 5.3 - and I
think that the line with "_VERSION:sub(5)" screams "I AM A TEMPORARY
FIX" loud enough.

  Cheers =) =),
    Eduardo Ochs
    http://angg.twu.net/dednat6.html

On Mon, 12 Aug 2019 at 15:44, Roberto Ierusalimschy <[hidden email]> wrote:
> It seems that the behavior of string.gsub changed from 5.2 to 5.3. In
> 5.1 and 5.2 the
>
>   arr = split(bigstr, "([^\n]*)\n?")
>
> would always generate an sequence with an empty string at its end, and
> table.remove would delete it.
>
> I remember ___veeeery___ vaguely a discussion here in the list about a
> change in string.gsub that made it stop substituting the empty sting
> at the end, but I'm not being able to find it now... anyone has a
> pointer to it?

- http://lua-users.org/lists/lua-l/2013-04/msg00812.html
- http://lua-users.org/lists/lua-l/2016-05/msg00198.html

See also commit 783aa8a9da5f3.

-- Roberto