Pooling of strings is good

classic Classic list List threaded Threaded
103 messages Options
1234 ... 6
Reply | Threaded
Open this post in threaded view
|

Pooling of strings is good

Dirk Laurie-2
2014-08-22 10:21 GMT+02:00 Coroutines <[hidden email]>:

> Anyway, this whole thing has me thinking that I wish "strings"
> did not exist in Lua.  It's really just userdata and sometimes
> I wish Lua didn't pool strings.

Except that userdata is advanced, whereas strings are very,
very basic indeed, there are languages like SNOBOL that have
nothing but.

Think of pooling strings as Lua's way of dispensing with the
need to write "const" zillions of times as in C++ (or even C, for
that matter). It gives me great comfort that I can write

   if line=="rather long string that appears only once"

without losing run-time efficiency.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Coroutines
On Fri, Aug 22, 2014 at 2:27 AM, Dirk Laurie <[hidden email]> wrote:

>
> Except that userdata is advanced, whereas strings are very,
> very basic indeed, there are languages like SNOBOL that have
> nothing but.
>
> Think of pooling strings as Lua's way of dispensing with the
> need to write "const" zillions of times as in C++ (or even C, for
> that matter). It gives me great comfort that I can write
>
>    if line=="rather long string that appears only once"

I am aware of what immutable, pooled strings affords me -- I simply
argue that it can be useful to have both facilities at your disposal.
Mutable, potentially duplicated strings and immutable, pooled strings.

I would rather we stop treating 'string' like a distinct type when it
really carried on the back of userdata.  I'd much rather see
type('cat') == 'userdata'.  We have the ability to create subtypes of
userdata in Lua, we could even manage string pooling directly in a
lookup table.  In my opinion, it might be a big "win" to drop the
responsibility of handling pooled, immutable strings in C -- as
designed by upstream.

I play with networking.  I've written maybe 2-3 different (but
similar) socket libraries for Lua that I've never had the courage to
release.  While doing so, it's always been a loss to use Lua's
standard string type when I want to fetch data from a socket and pass
it to the user to be worked on easily with the string library or
third-party libraries.  It's *expensive* to create long, immutable
strings but the libraries I've come across for unpacking data deal
mostly with strings and not userdata.  A lot of the time it's come
down to just passing data from a connection directly through to a
string parser for human-readable protocols.

I'm simply saying: Let's take a look at how many pooled strings are
compared against.  It's great if you can do a pointer comparison to
check for equality -- I know that -- but I'd rather see 'userdata'
become a 'buffer' type, extended for arbitrary filling and
extending/growing -- which you can then define into the 'string' type
we already love with things like __type and __next for iterating over
utf8.

Sorry if this is a bit confused, I have a lot of ideas in different
directions :\  At minimum I wish more of the standard library were
written in pure Lua, so things like string.match() not accepting
patterns with NULs in them can't happen again.  I like eliminating
liability.  I wish pretty much only the debug library and and the C
API were exposed to Lua, then we could write the rest in Lua directly
at a [negligible] performance loss.  From there we can define a pooled
string type under userdata as the community needs -- I wouldn't want
to see it gone, I just think both mutable and immutable strings are
useful.  And really I love things getting more generic -- this is very
off-topic but I was sad to see __pairs go.  I understand (Luiz's?)
argument that it'd be horrible to have 'one true iterator' for an
object, but I also feel like an iterator wrapped in a function is an
interface you can't predict if you wanted to wrap it in something
else.

Bah.  I've been working more in Javascript lately, reading the Lua
list over the last few weeks has been frustrating because it seems
like 5.3 is going to be jostling for all of us...

Kirk out.

PS: String are userdata :(  I don't want to treat them like a special
type when they *are* just userdata with added magic.  We could do it
in Lua.  userdata used for a buffer type would be much more applicable
~everywhere~, I feel like I have to turn to C to make the best of it
and that frustrates me.  I like working only in Lua if I can.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Dirk Laurie-2
2014-08-22 12:02 GMT+02:00 Coroutines <[hidden email]>:

> I've been working more in Javascript lately

You have my sympathy. I'm so sorry for you that I won't even
say anything nasty about your remark that:

> I'd much rather see type('cat') == 'userdata'.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Coroutines
On Fri, Aug 22, 2014 at 3:33 AM, Dirk Laurie <[hidden email]> wrote:
> 2014-08-22 12:02 GMT+02:00 Coroutines <[hidden email]>:
>
>> I've been working more in Javascript lately
>
> You have my sympathy. I'm so sorry for you that I won't even
> say anything nasty about your remark that:
>
>> I'd much rather see type('cat') == 'userdata'.

I've always been a fan of schenarios like: string.__type = function
(self) return original_type(self), 'string' end

local obj_type, obj_subtype = type('cat') -- 'userdata', 'string'

In my opinion, it would be nicer to stop pretending string is anything
other than "subclassed" from userdata.  The only "raw" types that
should exist are: nil, boolean, number, function, table, coroutine,
and userdata :p  -- and I'm pretty sure coroutines could be
represented as functions or vice-versa.  Python lets you yield from
any of its functions :p

Anyway, Javascript will never be as nice as Lua -- but I often think
about ways in which we can organize the fundamentals of Lua better.
I'm no expert, but this is where that thinking has led.  The mechanics
of Lua's string handling could be scripted from in Lua, and imo it
would be cleaner to do so.  Lua is very fast, but I think I'd be much
happier seeing the standard libs written in Lua directly (and the
string type) -- so that it's more accessible and "hackable" for those
embedding Lua.

I'll see myself out... lol.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Jay Carlson

Mutable strings are...a folly of a quaint, bygone era. They are warts on otherwise timeless designs like Scheme and Smalltalk.

Get a mutable octet-vector object type. You'll be much happier.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Coroutines
On Fri, Aug 22, 2014 at 3:59 AM, Jay Carlson <[hidden email]> wrote:
> Mutable strings are...a folly of a quaint, bygone era. They are warts on
> otherwise timeless designs like Scheme and Smalltalk.
>
> Get a mutable octet-vector object type. You'll be much happier.

Again I get this feeling like people love to speak in generalizations
on this list :\  Perhaps you know this: In networking/socket stuff
it's useful to reuse a buffer if you know you're receiving frames of a
fixed size, packets of a fixed size -- it can be wasteful and
contentious to allocate for strings to push this data into.  I work
with other libraries that expect strings -- this received data isn't
compared against it is parsed (so pointer comparisons are not a huge
gain for pooled strings).  This data is almost always unique because
of things like the IP header making it so.  Userdata on its own is
pretty inaccessible -- you cannot make comparisons between userdata or
use the functions provided by the string library to parse userdata
directly.  Strings are a thin abstraction over userdata -- it feels
like I'm working around an API that should be more accomodating.  If I
wanted to minimally change Lua how it is now, I'd make it so userdata
can be used in the same places strings can.

Essentially:
immutable_string1 == immutable_string2 -- would be a pointer comparison
immutable_string1 == mutable_string3 -- would be a memcmp()

I'm somewhat surprised people doing embedded work don't wish userdata
could conform to being a mutable string type in OOM situations.
Modifying a string does mean allocating for it twice..

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Dirk Laurie-2
2014-08-22 13:13 GMT+02:00 Coroutines <[hidden email]>:

We should stop replying to you now. It's 04:15 in your timezone,
more than an hour after you announced you were calling it a night.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Coroutines
On Fri, Aug 22, 2014 at 4:19 AM, Dirk Laurie <[hidden email]> wrote:
> 2014-08-22 13:13 GMT+02:00 Coroutines <[hidden email]>:
>
> We should stop replying to you now. It's 04:15 in your timezone,
> more than an hour after you announced you were calling it a night.

I actually stay up to make breakfast as if it were dinner :-)  I said
I was going to bed but only as a figure of speech -- I did not think
my ideas would "break well".  Some call me outlandish :p

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Axel Kittenberger
In reply to this post by Dirk Laurie-2
Strings are a native thing, since there are string literals in source code. I don't think anyone wants to string literals to go. That what makes them different to generic UserData. There is more to it due to automatic coercions and the '..' operator, but thats another story.

Mutables are bad. More mutables are worse. If I'd see a language develop, it be more in the directions of immutables in the direction of deeply frozen tables. Albeit it may seem counterintuitive at first, so many problems and issues go away if one would use immutables. For example the whole computation times of the # operator, vs. ipairs would go away in a poof.



On Fri, Aug 22, 2014 at 11:27 AM, Dirk Laurie <[hidden email]> wrote:
2014-08-22 10:21 GMT+02:00 Coroutines <[hidden email]>:

> Anyway, this whole thing has me thinking that I wish "strings"
> did not exist in Lua.  It's really just userdata and sometimes
> I wish Lua didn't pool strings.

Except that userdata is advanced, whereas strings are very,
very basic indeed, there are languages like SNOBOL that have
nothing but.

Think of pooling strings as Lua's way of dispensing with the
need to write "const" zillions of times as in C++ (or even C, for
that matter). It gives me great comfort that I can write

   if line=="rather long string that appears only once"

without losing run-time efficiency.


Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Coroutines
On Fri, Aug 22, 2014 at 4:41 AM, Axel Kittenberger <[hidden email]> wrote:

> Strings are a native thing, since there are string literals in source code.
> I don't think anyone wants to string literals to go. That what makes them
> different to generic UserData. There is more to it due to automatic
> coercions and the '..' operator, but thats another story.
>
> Mutables are bad. More mutables are worse. If I'd see a language develop, it
> be more in the directions of immutables in the direction of deeply frozen
> tables. Albeit it may seem counterintuitive at first, so many problems and
> issues go away if one would use immutables. For example the whole
> computation times of the # operator, vs. ipairs would go away in a poof.

In my opinion, I feel like people talk about mutable types like they
talk about use of 'goto'.  A lot of people advise you to never look in
that direction, and a few instruct you on the appropriate places they
can exist.  Honestly I'd be happy to continue letting strings be
default: immutable & pooled

I'd be happier being able to coerce userdata to string without having
to touch C, or treat userdata as a mutable string where no hash is
computed (a transparent memcmp() in string-userdata comparisons --
most of the string library functions don't need to know the hash of
the string anyway, they deal directly with memory).  Again, I say that
mutable strings have their place -- it can be contentious and wasteful
to realloc() and as I said before, there are many places where a
mutable buffer type would be welcome in the network stuff I do.  I
just hate that I have to promote these to full-blown strings to use
the string library functions on them or pass them to other libraries
that don't like userdata.  For me, there is a need.

immutable types have only become popular because of how cheap memory
is in computers these days and it's easy to orient a program for
sharing data between processes, threads, and through IPC if it is
immutable.  Did we forget that Lua is embedded and caters to embedded
devices?  I see a lot of people reinvent buffer types that could be
userdata and/or lua_Buffer's.  I think of userdata like a black box,
and I wish it were easier to pass around and look inside.  Part of
that is the appeal -- what you can't see is good encapsulation.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Roberto Ierusalimschy
In reply to this post by Dirk Laurie-2
> Except that userdata is advanced, whereas strings are very,
> very basic indeed, there are languages like SNOBOL that have
> nothing but.

SNOBOL has integers, "real numbers", patterns, arrays, tables, plus
programmer-defined data types...

Maybe you meant Tcl?

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Dirk Laurie-2
2014-08-22 15:14 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:
>> Except that userdata is advanced, whereas strings are very,
>> very basic indeed, there are languages like SNOBOL that have
>> nothing but.
>
> SNOBOL has integers, "real numbers", patterns, arrays, tables, plus
> programmer-defined data types...
>
> Maybe you meant Tcl?

Sorry, I was relying on memory, which needs to go back more than
40 years in this case. Our IBM 360 Model 40 had just been upgraded
to a Model 67. It was the second-biggest computer in the country.
Only the Weather Bureau had a bigger one. The IBM salesman gave
a big party at his house and handed out cigars like a new father.

Thanks! If nobody had corrected me, I would not now have a working
Snobol4 with a PDF of the Green Book on my computer. I feel as if
someone parked a brand-new Peugeot 403 in my driveway with a card
saying "for you".

Warning: while I play with my new toy, I may be a little quieter than
usual on Lua-L.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Sean Conner
In reply to this post by Coroutines
It was thus said that the Great Coroutines once stated:

> On Fri, Aug 22, 2014 at 2:27 AM, Dirk Laurie <[hidden email]> wrote:
>
> >
> > Except that userdata is advanced, whereas strings are very,
> > very basic indeed, there are languages like SNOBOL that have
> > nothing but.
> >
> > Think of pooling strings as Lua's way of dispensing with the
> > need to write "const" zillions of times as in C++ (or even C, for
> > that matter). It gives me great comfort that I can write
> >
> >    if line=="rather long string that appears only once"
>
> I am aware of what immutable, pooled strings affords me -- I simply
> argue that it can be useful to have both facilities at your disposal.
> Mutable, potentially duplicated strings and immutable, pooled strings.
>
> I would rather we stop treating 'string' like a distinct type when it
> really carried on the back of userdata.  I'd much rather see
> type('cat') == 'userdata'.  We have the ability to create subtypes of
> userdata in Lua, we could even manage string pooling directly in a
> lookup table.  In my opinion, it might be a big "win" to drop the
> responsibility of handling pooled, immutable strings in C -- as
> designed by upstream.
>
> I play with networking.  

  So do I [1].  Both for my own stuff, and for work.

> While doing so, it's always been a loss to use Lua's
> standard string type when I want to fetch data from a socket and pass
> it to the user to be worked on easily with the string library or
> third-party libraries.  It's *expensive* to create long, immutable
> strings but the libraries I've come across for unpacking data deal
> mostly with strings and not userdata.  A lot of the time it's come
> down to just passing data from a connection directly through to a
> string parser for human-readable protocols.

  First suggestion:  write code as if your idea already existed.  How
different is it?  From Lua:

        if type(x) == 'userdata' then
          -- what?  Do I have a file?  A socket?
          -- A network address?  An LPeg expression?
        end

  Okay, that suggests to me that type(), as defined, isn't sufficient.
Okay, let's assume we have typeof() then [2] which expects a userdata and
returns the type of userdata.  So now we have:

        if type(x) == 'userdata' then
          if typeof(x) == ... um ... what now?

  An easy answer is the tname argument to luaL_checkudata() (from C).  

        if type(x) == 'userdata' then
          if typeof(x) == 'FILE*' then
            -- we have a file
          elseif typeof(x) == 'string' then
            -- we have Coroutine's string thingy ...
          end
        end

  But now you are exposing what has been an implementation detail.  The
tname just has to be unique, it doesn't have to have any meaning (so I guess
the days of #define MY_TYPE "\200\201\202\203\344\377\300" are over) and we
run headlong into the namespace problem we (potentially) have with modules,
but now with types.

  I'm not shooting down your idea---I'm just pointing out that there are
unintended consequences to design decisions.

  Other unintended consequences---why did I create a new function typeof()?
Why not extend type() to return a second value?  Because there might be
code that expects type() to only return a single value (calling type()
within the definition of an array)?  

  Okay, let's go with what we have so far.  type() and typeof() (which
returns the formerly opaque tname).  And I want to use this new string type.
Right now, my socket API looks like:

        remote_addr,packet_data,error = sock:recv()

        -- remote_addr == userdata of type org.conman.net:addr
        -- packet_data == userdata of Coroutine's string
        -- error       == number

  Okay, I can add a paramter to that:

        remote_addr,packet_data,error = sock:recv(buffer)

  And the C code (because this *has* to be in C, unless I'm doing this with
LuaJIT):

        static int socklua_recv(lua_State *L)
        {
          sock__t *sock   = luaL_checkudata(L,1,TYPE_SOCK);
          lua_String *buf = luaL_checkmutablestring(L,2);
         
          ... um ...
         
        }
       
Hmmm ... this is bringing up an interesting problem---do we need to check
for mutable strings (buffers?) and immutable strings?  What happens to C
code that does:

        const char *name = lua_tostring(L,idx);

Right now, we get back an immutable string (or a number converted to an
immutable string, but that's a different argument).  I suppose it could
check behind the covers and if it gets a mutable string, just return the
internal pointer (as a const char *) so that doesn't have to change.  Okay,
good, but that does lead to another issue---documentation.  Right now, I
just have to say that a Lua function takes a string and it's over.  With
mutable strings, we now need to make a distinction between mutable and
immutable.  Does that mean we have implicit conversion between mutable and
immutable?  Is that an error now?  (or up to the programmer to decide?)
Should we, as a convenience, convert a number to a mutable string, like we
now convert numbers to immutable strings?

  Mutable strings also have two properties---the overall size (space set
aside to hold the mutable string) and the actual usage.  How does space
reallocation work?  And here I'm talking about the C side of things, where I
still do quite a bit of work.  Right now, the code for sock:recv() does:

        char buffer[65536uL];
        bytes = recvfrom(sock->fh,buffer,sizeof(buffer),0,&remaddr->sa,&remsize);
        lua_pushlstring(L,buffer,bytes);

(Okay, an aside---my networking API works more on the packet level than as a
stream.  An IP packet can be 65536 bytes in size (although realistically,
you'll might see a 65,508 byte packet (UDP with 20 bytes overhead for IP
header, and 8 bytes for UDP header) or a 65,496 byte packet (waaay less
likely, but that's 20 bytes for IP header, and 20 bytes for TCP header).
I'm not making any assumptions here about the packet size (because my
network API can also do raw socket operations) so I'm going with the maximum
size IP packet I could receive)

  Now, I have to check the minimum size, and if it's too low, grow it.  Or I
could just accept it (in which case, the rest of the packet is tossed out by
the kernel) or error out.  But in any case, it's a bit more to think about
now.

  Okay, where was I?  

> Sorry if this is a bit confused, I have a lot of ideas in different
> directions :\  At minimum I wish more of the standard library were
> written in pure Lua, so things like string.match() not accepting
> patterns with NULs in them can't happen again.  

  Okay, nothing stopping anybody from doing that.  But I suspect only a
small subset of the standard Lua library could be done in pure Lua.  

> PS: String are userdata :(  I don't want to treat them like a special
> type when they *are* just userdata with added magic.  We could do it
> in Lua.  userdata used for a buffer type would be much more applicable
> ~everywhere~, I feel like I have to turn to C to make the best of it
> and that frustrates me.  I like working only in Lua if I can.

  -spc (By the way, how much C programming do you do?)

[1] https://github.com/spc476/lua-conmanorg/blob/master/src/net.c


Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

William Ahern
In reply to this post by Axel Kittenberger
On Fri, Aug 22, 2014 at 01:41:02PM +0200, Axel Kittenberger wrote:

> Strings are a native thing, since there are string literals in source code.
> I don't think anyone wants to string literals to go. That what makes them
> different to generic UserData. There is more to it due to automatic
> coercions and the '..' operator, but thats another story.
>
> Mutables are bad. More mutables are worse. If I'd see a language develop,
> it be more in the directions of immutables in the direction of deeply
> frozen tables. Albeit it may seem counterintuitive at first, so many
> problems and issues go away if one would use immutables. For example the
> whole computation times of the # operator, vs. ipairs would go away in a
> poof.

Or the Lua VM could just memoize the computation of __len. Which would look
almost exactly the same under the hood.

I agree mutables cause trouble. But do they cause more trouble than not
having mutables? I mean, why are we all coding in Lua rather than Haskell?


Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Coroutines
In reply to this post by Sean Conner
On Fri, Aug 22, 2014 at 2:13 PM, Sean Conner <[hidden email]> wrote:

>> I play with networking.

So then you know what a pain it is to not be able to represent a
simple ring buffer, or provide DMA access to the NIC's ring buffer
hosted through userdata to Lua -- without having to copy it out and
promote it to a string?  I wish userdata could be used in the same
place a string is expected.  I would love an "invisible" memcmp() if
it's a userdata-string comparison (==) or to be able to directly use
the string.*() functions on them.  Strings are sealed userdata -- if
you know you need a fixed buffer whose contents are changed quite
often then I wish userdata could conform to this use.  Generally
duplicating them into a string is okay because we don't handle very
large data, but lots of frequent reallocation frustrates me.  This
stuff doesn't need to be in Lua's string pool for memory or comparison
efficiency :p  All I really need is read access to it -- and I don't
want to have to rewrite what's in the string library for userdata.

>   But now you are exposing what has been an implementation detail.  The
> tname just has to be unique, it doesn't have to have any meaning (so I guess
> the days of #define MY_TYPE "\200\201\202\203\344\377\300" are over) and we
> run headlong into the namespace problem we (potentially) have with modules,
> but now with types.

I think -- like module names -- people will have to put some thought
into their typenames.  I would not expect people to pull in libraries
introducing a lot of userdata types anyway, I feel like it would be a
small concern.

>   Other unintended consequences---why did I create a new function typeof()?
> Why not extend type() to return a second value?  Because there might be
> code that expects type() to only return a single value (calling type()
> within the definition of an array)?

Disagreement here, I would avoid a typeof() and just add the 2nd
return to type() -- or have a type() and rawtype().  Doesn't really
matter ~

> Hmmm ... this is bringing up an interesting problem---do we need to check
> for mutable strings (buffers?) and immutable strings?  What happens to C
> code that does:
>
>         const char *name = lua_tostring(L,idx);

I imagine nothing changes, as you are referencing const char.. :p

>         char buffer[65536uL];
>         bytes = recvfrom(sock->fh,buffer,sizeof(buffer),0,&remaddr->sa,&remsize);
>         lua_pushlstring(L,buffer,bytes);

The only problem I have with this is while I know of no foolproof way
to multithread an embedded Lua process, it seems like a bad idea to
have a "shared" static buffer in a library? :>

>   Okay, nothing stopping anybody from doing that.  But I suspect only a
> small subset of the standard Lua library could be done in pure Lua.

Mostly I just thought it would be clearer to see how args are accepted
and parsed and it would be easier to notice issues/bugs.  The other
bonus would be it'd be easier for upstream and the community to
prototype new functions to add to the string or table or whatever
library.  The standard libraries themselves could be released as less
of a 'standard' and more of a 'go add to it' example.  I think Lua
would grow more quickly because of it -- and in ways the community
could regard and come to a consensus on?

>
>> PS: String are userdata :(  I don't want to treat them like a special
>> type when they *are* just userdata with added magic.  We could do it
>> in Lua.  userdata used for a buffer type would be much more applicable
>> ~everywhere~, I feel like I have to turn to C to make the best of it
>> and that frustrates me.  I like working only in Lua if I can.
>
>   -spc (By the way, how much C programming do you do?)

I'd like to think I'm fairly knowledgeable of C -- I've been at it for
7 years now but that is in no way a good measurement of if someone is
"capable" in C.  I mostly use it for making bindings today if I can
prototype in Lua or JS faster.  I've made bindings for BSD sockets,
libev, and libevent in Lua.  I've been toying with nodejs in
Javascript so there is no real need...  Before all this I used to
write code for a certain IRC daemon.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Sean Conner
It was thus said that the Great Coroutines once stated:
> On Fri, Aug 22, 2014 at 2:13 PM, Sean Conner <[hidden email]> wrote:
>
> >> I play with networking.
>
> So then you know what a pain it is to not be able to represent a
> simple ring buffer, or provide DMA access to the NIC's ring buffer
> hosted through userdata to Lua -- without having to copy it out and
> promote it to a string?  

  It doesn't bother me, because the convenience it affords is worth the
price (in my opinion).  

> I wish userdata could be used in the same
> place a string is expected.  I would love an "invisible" memcmp() if
> it's a userdata-string comparison (==) or to be able to directly use
> the string.*() functions on them.

  I swear, the way you are talking, it's as if you equate "userdata" with
"Lua's internal string representation" and it jars me every time.  I have
plenty of userdata types that have nothing to do with strings; where
"userdata == string" just does not make any semamtic sense what-so-ever [1].

  As an example, some of the datafiles we use at work are quite large [2].
Because of this (and an internal binary structure) I mmap() the files into
memory and wrap a userdata around it, providing a __index method to return
records based on ordinal position or based on a key (the files are
specialized key/value stores) [3][4].  The userdata here represents a huge
amount of data, not just one "thing".

  Another userdata is the wrapper for iconv.  Heck, the header file,
iconv.h, doesn't even define the structure:

        typedef void *iconv_t;

so again, I'm at a loss as to what "userdata == string" even means in that
case.

> Strings are sealed userdata -- if
> you know you need a fixed buffer whose contents are changed quite
> often then I wish userdata could conform to this use.  Generally
> duplicating them into a string is okay because we don't handle very
> large data, but lots of frequent reallocation frustrates me.  This
> stuff doesn't need to be in Lua's string pool for memory or comparison
> efficiency :p  All I really need is read access to it -- and I don't
> want to have to rewrite what's in the string library for userdata.

  So really, what you are after is a buffer type userdata that can be used,
invisibly, as an immutable string.  You could hack Lua to do that, as a
proof-of-concept, get some feedback, do some benchmarking and possible get
enough users to convince PUC to officially add it.

> >   But now you are exposing what has been an implementation detail.  The
> > tname just has to be unique, it doesn't have to have any meaning (so I guess
> > the days of #define MY_TYPE "\200\201\202\203\344\377\300" are over) and we
> > run headlong into the namespace problem we (potentially) have with modules,
> > but now with types.
>
> I think -- like module names -- people will have to put some thought
> into their typenames.  I would not expect people to pull in libraries
> introducing a lot of userdata types anyway, I feel like it would be a
> small concern.

  I use 22 userdata types at work (just counted).  It's not a small concern.

> >   Other unintended consequences---why did I create a new function typeof()?
> > Why not extend type() to return a second value?  Because there might be
> > code that expects type() to only return a single value (calling type()
> > within the definition of an array)?
>
> Disagreement here, I would avoid a typeof() and just add the 2nd
> return to type() -- or have a type() and rawtype().  Doesn't really
> matter ~

  If it doesn't matter, why not typeof()?  

> > Hmmm ... this is bringing up an interesting problem---do we need to check
> > for mutable strings (buffers?) and immutable strings?  What happens to C
> > code that does:
> >
> >         const char *name = lua_tostring(L,idx);
>
> I imagine nothing changes, as you are referencing const char.. :p
>
> >         char buffer[65536uL];
> >         bytes = recvfrom(sock->fh,buffer,sizeof(buffer),0,&remaddr->sa,&remsize);
> >         lua_pushlstring(L,buffer,bytes);
>
> The only problem I have with this is while I know of no foolproof way
> to multithread an embedded Lua process, it seems like a bad idea to
> have a "shared" static buffer in a library? :>

  I never said it was static.  That "buffer" there is an auto---declared on
the stack of the socklua_recv() function [5].  No static buffer here.

  And the way I handle multithreaded process with Lua embedded is to give
each operating system thread its own Lua state.  Yeah, I could provide the
lua_lock() and lua_unlock() implementations, but I'd rather avoid the loss
of speed that a global interpreter lock imposes [6].

> >   Okay, nothing stopping anybody from doing that.  But I suspect only a
> > small subset of the standard Lua library could be done in pure Lua.
>
> Mostly I just thought it would be clearer to see how args are accepted
> and parsed and it would be easier to notice issues/bugs.  The other
> bonus would be it'd be easier for upstream and the community to
> prototype new functions to add to the string or table or whatever
> library.  The standard libraries themselves could be released as less
> of a 'standard' and more of a 'go add to it' example.  I think Lua
> would grow more quickly because of it -- and in ways the community
> could regard and come to a consensus on?

  You could always try that approach.



[1] That's not to say I don't have a __tostring() method attached to
        them---I do.  But mostly for debugging purposes, not for any real use.

[2] Relatively speaking.  I mean, 100,000,000 records does take up some
        space.

[3] In this case, when you do an index reference, you get back a plain
        Lua table with the data converted to more natural data types.  For
        instance, one such file maps phone numbers (the key) to
        functionality (this phone number can do A, B and C).  When
        referenced:

                x = mapping['5555551212']

        x is a Lua table:

                {
                  number = '5555551212',
                  A = true,
                  B = true,
                  C = false
                }

        The reason I include the number is because you can also reference it
        like:

                x = mapping[456]

        The number is stored in a compressed binary format (to save space)
        and the feature indicators are individual bits.  

[4] Why not a real database, or some other key/value store?  Database
        engines weren't fast enough (we're literally in the "call chain" of
        a phone call) and the storage medium is ... interesting in the
        Chinese definition of "interesting".

[5] https://github.com/spc476/lua-conmanorg/blob/b7f2414391375c4d872d217fb9b297695a1dac11/src/net.c#L1471

[6] I'm looking at you, Python.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Philipp Janda
In reply to this post by Coroutines
Am 23.08.2014 um 07:59 schröbte Coroutines:
> On Fri, Aug 22, 2014 at 2:13 PM, Sean Conner <[hidden email]> wrote:
>
>>    But now you are exposing what has been an implementation detail.  The
>> tname just has to be unique, it doesn't have to have any meaning (so I guess
>> the days of #define MY_TYPE "\200\201\202\203\344\377\300" are over) and we
>> run headlong into the namespace problem we (potentially) have with modules,
>> but now with types.

Lua 5.3 sometimes uses the type names of `luaL_newmetatable` in error
messages, so it'd better be printable.

>
> I think -- like module names -- people will have to put some thought
> into their typenames.  I would not expect people to pull in libraries
> introducing a lot of userdata types anyway, I feel like it would be a
> small concern.

Depends on the C library (and the person who did the binding): I have an
incomplete binding to APR (just APR, no APR-Util, no threading, no
network io) with 15 different userdata types, and a very incomplete
binding to SDL with 17 different userdata types. But I tend to use
userdata for some enums/flags for increased type safety ...

Philipp




Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Philipp Janda
Am 23.08.2014 um 09:14 schröbte Philipp Janda:
>
> Depends on the C library (and the person who did the binding): I have an
> incomplete binding to APR (just APR, no APR-Util, no threading, no
> network io) with 15 different userdata types, and a very incomplete
> binding to SDL with 17 different userdata types.

Make that 27 userdata types for SDL, the different events were hidden
behind a macro.

Philipp



Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Coroutines
In reply to this post by Sean Conner
On Sat, Aug 23, 2014 at 12:11 AM, Sean Conner <[hidden email]> wrote:

>   It doesn't bother me, because the convenience it affords is worth the
> price (in my opinion).

What convenience?  I'm saying the interface doesn't exist to make
mutable data (userdata) as accessible as immutable, pooled strings.

>   I swear, the way you are talking, it's as if you equate "userdata" with
> "Lua's internal string representation" and it jars me every time.  I have
> plenty of userdata types that have nothing to do with strings; where
> "userdata == string" just does not make any semamtic sense what-so-ever [1].

I should word this more carefully: I mean to say that userdata and
strings (the internal structures and how they are handled in Lua) are
very similar.

http://www.lua.org/source/5.2/lobject.h.html#TString

Both are basically: header + data, the size of the data that follows
is known in the header.  Strings have a hash, userdata have
environment tables -- but otherwise they are very similar.  It would
be trivial to walk the contents of either as you know where it starts
and where it ends.  In Lua you can walk/traverse strings, but you
cannot do the same with userdata.  Often it is "simpler" to just
dump/promote the userdata to a string.  I will correct myself here: I
thought lua_Buffer's were a wrapper around userdata because it would
have made sense to me to reuse data structures and the functions
defined for them.

They are not: http://www.lua.org/source/5.2/lauxlib.h.html#luaL_Buffer

> so again, I'm at a loss as to what "userdata == string" even means in that
> case.

I'm saying I want to be able to write 'cat' == io.stdout and have it
do a memcmp() of their contents (the data section after the header).

>   So really, what you are after is a buffer type userdata that can be used,
> invisibly, as an immutable string.  You could hack Lua to do that, as a
> proof-of-concept, get some feedback, do some benchmarking and possible get
> enough users to convince PUC to officially add it.

I'm not necessarily asking for a whole new defined mutable type.  I
just want the ability to use userdata where a string would be
expected.  The mutable type is already available: userdata

All I'm thinking about is things like this:

string.sub(io.stdout, 1) -- dumping the lua_Stream struct into a Lua string

Or this:

debug.getregistry()['FILE*'].__index = function (self, key) return
self[key] or string.sub(self, key) end -- and then you could: for i =
1, #io.stdout do local b = io.stdout[i] ... end

(You could traverse userdata byte-by-byte like a string)

I'd love to just be able to read the contents of a userdata from Lua
:\  I realize this breaks encapsulation, so maybe there'd have to be
some switch/option in the debug library to use the string functions on
userdata (to allow or not).

>   I use 22 userdata types at work (just counted).  It's not a small concern.

I mean to say -- how  many of those are you defining personally?  How
many are brought in from third-parties?  I don't think you'd have much
problems deciding the names for each..

Related: It would be nice to be able to change the typename at runtime
(even so that checkudata passes) -- without having to recompile the
third-party library that defines the userdata type.

>   I never said it was static.  That "buffer" there is an auto---declared on
> the stack of the socklua_recv() function [5].  No static buffer here.

My bad, I thought I read static :\  Stack overflows are fun... :p  I
was just thinking offhandedly that you could create a mmap, expose it
to Lua as a file to read and write from -- and pass it to the
networking lib as the buffer to recv into.  Hmm.  And it could be
shared depending on the threading setup -- but that's an afterthought.

>   And the way I handle multithreaded process with Lua embedded is to give
> each operating system thread its own Lua state.  Yeah, I could provide the
> lua_lock() and lua_unlock() implementations, but I'd rather avoid the loss
> of speed that a global interpreter lock imposes [6].

I really wish I could just require() in the libs I want accessible to
"ALL TEH THREADS" and share that Lua state.  Locking is just too much
of a penalty...

On another side-topic: It makes me really unhappy that
lua_lock/unlock() are macros and not dummy functions.  This is Lua's
GIL :(  You have to recompile Lua to make use of them when I'd rather
LD_PRELOAD off of the Lua my distro distributed...  c'est la vie.

Reply | Threaded
Open this post in threaded view
|

Re: Pooling of strings is good

Coroutines
In reply to this post by Philipp Janda
On Sat, Aug 23, 2014 at 12:28 AM, Philipp Janda <[hidden email]> wrote:

> Am 23.08.2014 um 09:14 schröbte Philipp Janda:
>>
>>
>> Depends on the C library (and the person who did the binding): I have an
>> incomplete binding to APR (just APR, no APR-Util, no threading, no
>> network io) with 15 different userdata types, and a very incomplete
>> binding to SDL with 17 different userdata types.
>
>
> Make that 27 userdata types for SDL, the different events were hidden behind
> a macro.

I know this isn't enforced and people are free to do what they may; in
some projects I prefix the userdata type with the library name (if
it's short).  I find it to be a good practice in case my library is a
third-party to someone else's project.

1234 ... 6