Suggestion : Use unique string type instead of two (short and long)

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu
Lua has unique string type before 5.2.1 , all the strings is interning
for fast comparison .

Lua 5.2.1 add a new internal long string type because of hash DoS
attack , but short string is still interning. I guess the reason is
the performance of string comparison is very important to lua , string
interning can reduce the string comparison from O(n) to O(1).

I think if we can find another way to compare strings with O(1) , we
can use only one string type without interning.

We can add an unique 64bit id into TString struct , every time we push
a string into lua VM , allocate a new id to it . When we need compare
two strings, compare the id first. If id are the same, they are
equality, otherwise we compare the hash. If hash is also the same, use
memcmp.  And if the strings are equality, we can change the lower id
to the higher one.

For example, if we have two string A and B with the same value "Hello"
. At first , they have different id A.id is 1 and B.id is 2 . After
memcmp, we found A == B , so we can change A.id = 2 . Next time, when
we need to compare A and B, we can just compare A.id and B.id , they
are the same, so we don't need memcmp .

I think it can simplify the implementation of lua, and we can share
the prototype of functions and const tables in multiple lua vm (with
minor modification).

--
http://blog.codingnow.com

Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

Matthew Wild
On Mon, 17 Jun 2019 at 12:57, 云风 Cloud Wu <[hidden email]> wrote:
>
> Lua has unique string type before 5.2.1 , all the strings is interning
> for fast comparison .
>
> Lua 5.2.1 add a new internal long string type because of hash DoS
> attack , but short string is still interning. I guess the reason is
> the performance of string comparison is very important to lua , string
> interning can reduce the string comparison from O(n) to O(1).

There is also a benefit in reduced RAM usage (in some applications).
But importantly a string's hash is also used for table lookups, which
is quite a key part of Lua.

Regards,
Matthew

Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu


>> 在 2019年6月17日,21:40,Matthew Wild <[hidden email]> 写道:
>>
>> On Mon, 17 Jun 2019 at 12:57, 云风 Cloud Wu <[hidden email]> wrote:
>>
>> Lua has unique string type before 5.2.1 , all the strings is interning
>> for fast comparison .
>>
>> Lua 5.2.1 add a new internal long string type because of hash DoS
>> attack , but short string is still interning. I guess the reason is
>> the performance of string comparison is very important to lua , string
>> interning can reduce the string comparison from O(n) to O(1).
>
> There is also a benefit in reduced RAM usage (in some applications).
> But importantly a string's hash is also used for table lookups, which
> is quite a key part of Lua.
>
> Regards,
> Matthew

The hash is remained, but can be lazy calculated like long string now.

To reduce memory usage , we can do string interning in parser stage (it’s the main source of the string object) to remove the same strings. And we can also use a cache like lua_pushstring now to avoid push the same string or combine the same string during gc .
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

Coda Highland
In reply to this post by Matthew Wild


On Mon, Jun 17, 2019 at 8:40 AM Matthew Wild <[hidden email]> wrote:
On Mon, 17 Jun 2019 at 12:57, 云风 Cloud Wu <[hidden email]> wrote:
>
> Lua has unique string type before 5.2.1 , all the strings is interning
> for fast comparison .
>
> Lua 5.2.1 add a new internal long string type because of hash DoS
> attack , but short string is still interning. I guess the reason is
> the performance of string comparison is very important to lua , string
> interning can reduce the string comparison from O(n) to O(1).

There is also a benefit in reduced RAM usage (in some applications).
But importantly a string's hash is also used for table lookups, which
is quite a key part of Lua.

Regards,
Matthew


No pun intended? XD

The reduced RAM usage is more widespread than you might imagine. Consider Zipf's Law, which observes that the most common words in a data set are WAY more common than the least common ones. The Pareto principle lets us approximate it as "80% of the words in a piece of text come from 20% of the vocabulary." This means something as simple as splitting a string will benefit from short string interning.

On Mon, Jun 17, 2019 at 8:50 AM 云风 <[hidden email]> wrote:

The hash is remained, but can be lazy calculated like long string now.

To reduce memory usage , we can do string interning in parser stage (it’s the main source of the string object) to remove the same strings. And we can also use a cache like lua_pushstring now to avoid push the same string or combine the same string during gc .

I disagree that the parser stage is going to be the main source of string objects in general. Almost any program that has to read data from a file is going to use a lot of small strings. (It is, of course, possible to do it with a single long string and then only process it using numeric data types, but that technique only makes sense for packed binary data.)

/s/ Adam
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu


>
> I disagree that the parser stage is going to be the main source of string objects in general. Almost any program that has to read data from a file is going to use a lot of small strings. (It is, of course, possible to do it with a single long string and then only process it using numeric data types, but that technique only makes sense for packed binary data.)
>

A cache in GC can merge most of the same strings. When we mark a string to black , put it into the cache or merge to the previous one.

The cache can be fixed size, so it can immune the hash dos attack .
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

Roberto Ierusalimschy
In reply to this post by 云风 Cloud Wu
> Lua has unique string type before 5.2.1 , all the strings is interning
> for fast comparison .
>
> [...]
> attack , but short string is still interning. I guess the reason is
> the performance of string comparison is very important to lua , string
> interning can reduce the string comparison from O(n) to O(1).
>
> I think if we can find another way to compare strings with O(1) , we
> can use only one string type without interning.
>
> We can add an unique 64bit id into TString struct , every time we push
> a string into lua VM , allocate a new id to it . When we need compare
> two strings, compare the id first. If id are the same, they are
> equality, otherwise we compare the hash. If hash is also the same, use
> memcmp.  And if the strings are equality, we can change the lower id
> to the higher one.

It is a nice idea. One drawback to keep in mind is the addition of eight
bytes to the memory overhead of all strings. Another is that Lua does
not demand platforms to have a 64-bit integer type.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu
In reply to this post by Coda Highland
Another small benefit is lua_getfield can be faster. And it never raise error if the table has no metatable.

So we can use a lua table as a concurrent lookup table without lock (use one coroutine per work thread for the result).

发自我的 iPhone

在 2019年6月17日,22:01,Coda Highland <[hidden email]> 写道:



On Mon, Jun 17, 2019 at 8:40 AM Matthew Wild <[hidden email]> wrote:
On Mon, 17 Jun 2019 at 12:57, 云风 Cloud Wu <[hidden email]> wrote:
>
> Lua has unique string type before 5.2.1 , all the strings is interning
> for fast comparison .
>
> Lua 5.2.1 add a new internal long string type because of hash DoS
> attack , but short string is still interning. I guess the reason is
> the performance of string comparison is very important to lua , string
> interning can reduce the string comparison from O(n) to O(1).

There is also a benefit in reduced RAM usage (in some applications).
But importantly a string's hash is also used for table lookups, which
is quite a key part of Lua.

Regards,
Matthew


No pun intended? XD

The reduced RAM usage is more widespread than you might imagine. Consider Zipf's Law, which observes that the most common words in a data set are WAY more common than the least common ones. The Pareto principle lets us approximate it as "80% of the words in a piece of text come from 20% of the vocabulary." This means something as simple as splitting a string will benefit from short string interning.

On Mon, Jun 17, 2019 at 8:50 AM 云风 <[hidden email]> wrote:

The hash is remained, but can be lazy calculated like long string now.

To reduce memory usage , we can do string interning in parser stage (it’s the main source of the string object) to remove the same strings. And we can also use a cache like lua_pushstring now to avoid push the same string or combine the same string during gc .

I disagree that the parser stage is going to be the main source of string objects in general. Almost any program that has to read data from a file is going to use a lot of small strings. (It is, of course, possible to do it with a single long string and then only process it using numeric data types, but that technique only makes sense for packed binary data.)

/s/ Adam
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu
In reply to this post by Roberto Ierusalimschy


在 2019年6月17日,22:37,Roberto Ierusalimschy <[hidden email]> 写道:

>> Lua has unique string type before 5.2.1 , all the strings is interning
>> for fast comparison .
>>
>> [...]
>> attack , but short string is still interning. I guess the reason is
>> the performance of string comparison is very important to lua , string
>> interning can reduce the string comparison from O(n) to O(1).
>>
>> I think if we can find another way to compare strings with O(1) , we
>> can use only one string type without interning.
>>
>> We can add an unique 64bit id into TString struct , every time we push
>> a string into lua VM , allocate a new id to it . When we need compare
>> two strings, compare the id first. If id are the same, they are
>> equality, otherwise we compare the hash. If hash is also the same, use
>> memcmp.  And if the strings are equality, we can change the lower id
>> to the higher one.
>
> It is a nice idea. One drawback to keep in mind is the addition of eight
> bytes to the memory overhead of all strings. Another is that Lua does
> not demand platforms to have a 64-bit integer type.
>
> -- Roberto
>

If we don’t need string interning, the next pointer in TString is useless, so I think maybe no more overhead.
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu
In reply to this post by Roberto Ierusalimschy


在 2019年6月17日,22:37,Roberto Ierusalimschy <[hidden email]> 写道:


>> Another is that Lua does
> not demand platforms to have a 64-bit integer type.
>
>

Maybe a 32bit id is enough, because we can rearrange the id during the gc process. It’s a little complicated, but I think it’s possible.
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

Philippe Verdy
In reply to this post by 云风 Cloud Wu
One drawback, each time we use caching technics, is that we can get more exposure to time-based side-channel attacks.
I discussed recently with another use the possibility of using Lua's collision lists in its keyed tables if we did not care about using a striong enough hashing function that correctly flattens the distribution of use of hash buckets.

For short strings, this is good enough because all texts will still be hashed. But with long texts, hashing can become significantly attackable.
So a good approch would be to hash only the beginning of long texts and then make sure that the resulting distribution in hash buckets is still within acceptable limits (if not we can select another short segment in the text: the start of these fragments would be determined by the total string length, based on modular arithmetics which allows creating an easy reproducible sequence of start indexes, for the same length, from which the hash is computed).

And to protect from time-based attacks, of course the hashing function should be changed regularly (forcing keys to be rehashed, but as they would affect the whole table, of course reindexing all the long text would have a huge impact, that's why indexing only a fragment would be very useful).

Beside that, I don't think we should cache the full comparison of texts when performing lookups in the hashtable, as here also we can use the reproductible pseudo-random sequence of start indexes to compare only fragments (for fast elimination of non-matching keys) until the whole text has been compared (in a pseudo-random order of fragments).

Note that this applies independantly of the choice of the hashing function (it could be simple like Knuth's modular hashing using a multiplication by large enough prime, or cryptographically string like SHA1 or even SHA2 but with lower performance): even when using a cryptographically string hashing, the performance effect of long texts would be large enough to be measurable and attackable.

As a general rule however, secure data should not have very variable sizes that are identifiable in time-based attacks. So hashhing short items like user names, file names or passwords/passphrases should not be impacted by this risk as they should have almost all the same length (and because they are short, they can be easily padded before hashing them, so that they will all need roughly the same time to be hashed).

Hashing large contents however should be performed "offline" (e.g. when hashing file contents) before there's any attempt to exchange that content or fragments of that content, and before allowing others to detect that you access a specific largefile (which wil lthen only be known externally by its identifier: the precomputed hash value if the file content is readonly, or its short "filename", independantly of its content size. This will still allow using Lua's tables to index not the file contents themselves, but their short identifier (and a 512-bit identifier is long enough, if it is generated by a cryptographic hash function like SHA2, and that does not need any additional complex rehashing in Lua tables due to its existing properties).

If Lua introduces a "longstring" type, it MUST not be usable as a valid type for keys in table: you'll have to use alternate keys (the unique identifier of its content or URL/path generated externally). Lua should not even force using them in memory, they will be viewed just like externally stored "blobs", accessed by some I/O API when needed, but not kept permanently in memory. As well the caches used by this I/O API should be carefully designed to be immune to time-based side-channel attacks.


Le lun. 17 juin 2019 à 16:19, 云风 <[hidden email]> a écrit :


>
> I disagree that the parser stage is going to be the main source of string objects in general. Almost any program that has to read data from a file is going to use a lot of small strings. (It is, of course, possible to do it with a single long string and then only process it using numeric data types, but that technique only makes sense for packed binary data.)
>

A cache in GC can merge most of the same strings. When we mark a string to black , put it into the cache or merge to the previous one.

The cache can be fixed size, so it can immune the hash dos attack .
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

Roberto Ierusalimschy
In reply to this post by 云风 Cloud Wu
> If we don’t need string interning, the next pointer in TString is
> useless, so I think maybe no more overhead.

If we will have only one kind of string, we cannot store its length in a
byte. So, we need that field for the length.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

Roberto Ierusalimschy
In reply to this post by 云风 Cloud Wu
> Maybe a 32bit id is enough, because we can rearrange the id during the gc process. It’s a little complicated, but I think it’s possible.

Can you be more specific? It is easy for the GC to set a watermark
(e.g., to keep the highest/lower id still in use), but that does
not guarantee anything. We can also renumber all strings, paying
the price for a little overhead in the first comparisons after each
GC cycle.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu


在 2019年6月17日,23:41,Roberto Ierusalimschy <[hidden email]> 写道:

>> Maybe a 32bit id is enough, because we can rearrange the id during the gc process. It’s a little complicated, but I think it’s possible.
>
> Can you be more specific? It is easy for the GC to set a watermark
> (e.g., to keep the highest/lower id still in use), but that does
> not guarantee anything. We can also renumber all strings, paying
> the price for a little overhead in the first comparisons after each
> GC cycle.

My idea is use a fixed size hash cache. For example , there are 10 slots .

During the mark stage, we put the highest id and lowest id into the slots.

For example

high: 80 81 92 93 74 95 86 97 98 89
low:   00 11 02 03 14 05 16 27 28 09

It means the highest id we used is 89-98, and 01, 04, 06-08 can be reused. We can always find the highest id we still used, and a few unused id .

And then, we can remap 92, 93, 95, 97, 98 to 01 04 06 07 08 during the sweep stage.

So that , we can recycle a few id for each gc cycle.

For the length, we can still store 1 byte for small string length, if the string longer than 255 bytes, store 0xff and an additional size_t.


Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

Roberto Ierusalimschy
> My idea is use a fixed size hash cache. For example , there are 10 slots .
>
> During the mark stage, we put the highest id and lowest id into the slots.
>
> For example
>
> high: 80 81 92 93 74 95 86 97 98 89
> low:   00 11 02 03 14 05 16 27 28 09
>
> It means the highest id we used is 89-98, and 01, 04, 06-08 can be reused. We can always find the highest id we still used, and a few unused id .
>
> So that , we can recycle a few id for each gc cycle.

I don't think recycling "a few" ids each gc cycle is enough to control
the ids growing for ever. After these few are reused, we need more
new ids for new strings. We cannot run a new GC cycle every time a few
new strings are created.


> And then, we can remap 92, 93, 95, 97, 98 to 01 04 06 07 08 during the sweep stage.

I don't think we can do that. We remap one string from 92 to 01, but
there may be other equal strings also numbered 92, which were not
changed yet.  (Remember that the collector is incremental.)  Then, we
compare this first string with one still with the original number, and it
will go back to 92.


> For the length, we can still store 1 byte for small string length, if the string longer than 255 bytes, store 0xff and an additional size_t.

But then we are back with two kinds of strings again. Several of the
gains you suggested (e.g, a faster lua_getfield) go away.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu


在 2019年6月18日,03:42,Roberto Ierusalimschy <[hidden email]> 写道:

>
>> And then, we can remap 92, 93, 95, 97, 98 to 01 04 06 07 08 during the sweep stage.
>
> I don't think we can do that. We remap one string from 92 to 01, but
> there may be other equal strings also numbered 92, which were not
> changed yet.  (Remember that the collector is incremental.)  Then, we
> compare this first string with one still with the original number, and it
> will go back to 92.

We need use the lower id after comparison , because choose the older one would be better , maybe there are many older different string objects with the same value , and the newer string object should use the older id, otherwise all the older objects’ id should change.

So the other 92 can be changed into 01 in this case.

And I have a new idea :

We can separate 32bit into two id space , one is 0~2^31 and another is -2^31~ -1 .

At first , we use the positive part , and we choose smaller id after comparing. when the id exceed 2^31, we switch to negative part.

At this time, we renumber the id in sweep stage of gc just by allocate a new negative id for each string alive.

After renumber, the rule is changed to choose bigger id after comparing until we need to switch id space next time.

2^31 is a very large range, so we seldom renumber .


>
>
>> For the length, we can still store 1 byte for small string length, if the string longer than 255 bytes, store 0xff and an additional size_t.
>
> But then we are back with two kinds of strings again. Several of the
> gains you suggested (e.g, a faster lua_getfield) go away.
>

There is no two kinds of string, it is just a variable length number for string length. we need the length only in first comparison.

And I think lua_getfield would be faster because we don’t need interning and can use the string in C side directly.
Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu
> And I have a new idea :
>
> We can separate 32bit into two id space , one is 0~2^31 and another is -2^31~ -1 .
>
> At first , we use the positive part , and we choose smaller id after comparing. when the id exceed 2^31, we switch to negative part.
>
> At this time, we renumber the id in sweep stage of gc just by allocate a new negative id for each string alive.
>
> After renumber, the rule is changed to choose bigger id after comparing until we need to switch id space next time.
>
> 2^31 is a very large range, so we seldom renumber .
>

We don't need renumber all the strings in one gc cycle when switch id
space, because the id space is large enough. and we can use a small
cache
to remember the map of new/old id. So most of id after switching may
not need to recalculate.

--
http://blog.codingnow.com

Reply | Threaded
Open this post in threaded view
|

Re: Suggestion : Use unique string type instead of two (short and long)

云风 Cloud Wu
In reply to this post by Roberto Ierusalimschy


在 2019年6月17日,23:41,Roberto Ierusalimschy <[hidden email]> 写道:

>> Maybe a 32bit id is enough, because we can rearrange the id during the gc process. It’s a little complicated, but I think it’s possible.
>
> Can you be more specific? It is easy for the GC to set a watermark
> (e.g., to keep the highest/lower id still in use), but that does
> not guarantee anything. We can also renumber all strings, paying
> the price for a little overhead in the first comparisons after each
> GC cycle.
>


We can use two id spaces: old space [0, 0x7fffffff] and young space [0x80000000, 0xffffffff]

At start of each gc cycle , allocate id from 0x80000000,and all the new strings are young.

For each gc cycle, we only renumber young strings ( all or a part ) to old by allocating the id from 0 at sweep stage.

When we need  merge two id with the same values, we choose the lower one.

I think the highest old id would increase very slowly, because a young string may turns to old id when it compare to an old string with the same value.

When the highest old id exceed a very larger number, we can also renumber all to clear up.