So I had this weird idea to solve the indexing problem

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

So I had this weird idea to solve the indexing problem

Soni "They/Them" L.
hi, so, a lot of ppl won't use Lua because Lua uses 1-based indexing.
well, other languages use 0-based indexing. but, what if we had neither?

the trivial solution for "neither" is just to use linked lists. and that
would, indeed, work, quite badly as well.but what if we come up with
something completely different instead?

rather than limiting ourselves to numeric representations of arrays, I
suggest a much more tangible (and yes tangible is the right term)
approach: filters.

consider the following table:

local t = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

(also consider as if t[1] etc don't work to access array indexes but
only hashtable indexes)

how do you iterate it? well you'd still use ipairs ofc and it'd be just
as fast, it just wouldn't use numbers. so you'd do "for value in
ipairs(table) do". additionally we can also remove array indexes from
pairs() because it doesn't make any sense to keep them there.

how would you do indexing? well, this is where things get interesting.
let's say you want to get that 5 there. you could do it like this:

local i = {false, false, false, false, true}
local v = table.get(t, i) -- v = 5

great! we no longer need numeric indexing! oh, wait...

uhh... how would we dynamically index things...

well, one thing is consistent between 0-based indexing and 1-based
indexing: in 0-based indexing, the length is always just beyond the last
valid index. in 1-based indexing, the length is the last valid index.
this happens to match up with the number of values in the array. as such
we can keep #t.

one option is to have "left, mid, right = table.bisect(t)". this is
quite efficient as well.

local left, mid, right = table.bisect(t) -- mid is nil
local left, mid, right = table.bisect(right) -- mid is 8
local left, mid, right = table.bisect(left) -- left is 6, right is 7,
mid is nil

(having a separate mid lets we ignore the problem of bisecting an odd
number and having to decide which side it'll go to.)

combine it with #t and you can trivially do 0-based or 1-based indexing
in your own code, and it just works. you can even mix them in the same
code, which is useful in some contexts.

okay so we have bisect and all but this isn't complete yet. to really
finish it off we need one more thing. well, four, actually:
table.pop_left, table.pop_right, table.push_left, table.push_right.
additionally, table.bisect should return views, not full arrays, both so
it's faster and so we can pop_left on a view and have it modify the
original table.

but anyway, this way, everyone's happy, because numeric indexing is no
more. or, well, actually, it's meant to make everyone unhappy I guess,
but you get the point xd.

Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Egor Skriptunoff-2
On Sun, Jun 9, 2019 at 2:06 PM Soni "They/Them" L. wrote:
hi, so, a lot of ppl won't use Lua because Lua uses 1-based indexing.
well, other languages use 0-based indexing. but, what if we had neither?

the trivial solution for "neither" is just to use linked lists. and that
would, indeed, work, quite badly as well.but what if we come up with
something completely different instead?

rather than limiting ourselves to numeric representations of arrays, I
suggest a much more tangible (and yes tangible is the right term)
approach: filters.

consider the following table:

local t = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

(also consider as if t[1] etc don't work to access array indexes but
only hashtable indexes)

how do you iterate it? well you'd still use ipairs ofc and it'd be just
as fast, it just wouldn't use numbers. so you'd do "for value in
ipairs(table) do". additionally we can also remove array indexes from
pairs() because it doesn't make any sense to keep them there.

how would you do indexing? well, this is where things get interesting.
let's say you want to get that 5 there. you could do it like this:

local i = {false, false, false, false, true}
local v = table.get(t, i) -- v = 5

great! we no longer need numeric indexing! oh, wait...

uhh... how would we dynamically index things...

well, one thing is consistent between 0-based indexing and 1-based
indexing: in 0-based indexing, the length is always just beyond the last
valid index. in 1-based indexing, the length is the last valid index.
this happens to match up with the number of values in the array. as such
we can keep #t.

one option is to have "left, mid, right = table.bisect(t)". this is
quite efficient as well.

local left, mid, right = table.bisect(t) -- mid is nil
local left, mid, right = table.bisect(right) -- mid is 8
local left, mid, right = table.bisect(left) -- left is 6, right is 7,
mid is nil

(having a separate mid lets we ignore the problem of bisecting an odd
number and having to decide which side it'll go to.)

combine it with #t and you can trivially do 0-based or 1-based indexing
in your own code, and it just works. you can even mix them in the same
code, which is useful in some contexts.

okay so we have bisect and all but this isn't complete yet. to really
finish it off we need one more thing. well, four, actually:
table.pop_left, table.pop_right, table.push_left, table.push_right.
additionally, table.bisect should return views, not full arrays, both so
it's faster and so we can pop_left on a view and have it modify the
original table.

but anyway, this way, everyone's happy, because numeric indexing is no
more. or, well, actually, it's meant to make everyone unhappy I guess,
but you get the point xd.



(This is off-topic)
Your proposal reminds me another idea about storing table's keys ordered by value instead of ordered by hash:

Usually Lua table consists of "array" part and "hash" part.
The non-"array" part should store its data in a different way: it should be reorganized from "hash" part into "tree" part.
All keys in "tree" part of each Lua table should be arranged in some fixed order (imagine that we could compare two arbitrary Lua values of arbitrary datatypes).
Some kind of balanced-tree data structure should be implemented to make all table operations having O(log(N)) time complexity, where N is the number of keys in a "tree" part of a table.

This approach has the following benefits:

#(1) Operations "get next/prev key" will be O(1), as always.
#(2) Operations "get leftmost/rightmost key" will be O(log(N)).
These two items mean that we could implement next() as fast as in Lua 5.3.
Please note that "get leftmost key" operation was not O(1) in Lua 5.3, see the following example:
   while next(t) do t[next(t)] = nil end

#(3) Operations "move M elements to the left/right" will be O(log(N)) for large M.
Hence, we could access a table in a specific way which is independent on the base index:
old style: tbl[1]   new style: get_leftmost(tbl)
old style: tbl[5]   new style: move_right(get_leftmost(tbl), 4)
Of course, this syntax is not neat, it just gives the idea of operation.

#(4) Operation "split diapason key1..key2 into two equal parts" will be O(log(N)).
The primitive "dichotomy the range of keys" might be useful for many cases, including parallelization.

#(5) Balanced trees are immune to DDOS attacks (unlike hash-tables).
Lua would not have to pseudo-randomly shuffle the order of keys used by next() function on each VM initialization. 
Sequence of keys generated by pairs() will be deterministic (unless tables are used as keys).
For example, it would be convenient if all numerical keys go in their natural order when traversed with pairs():
-1.5, 0, 1, 2, 3, math.pi, 4, ..., math.huge

#(6) Memory block size for storing a balanced tree may be not equal to a power of 2 (as for hash tables).
For example, when extending a "tree" part of a growing table, any memory block 2x...4x of the previous size would be OK.
Memory for storing a balanced tree also can be fragmented into several continuous chunks without making the implementation more slower or more complex (unlike hash-tables).
These would make memory manager's life a bit easier.


Drawbacks:
#(-1) Table indexing will be slightly slower (hash tables, despite of collisions when indexing, are faster than balanced trees).
#(-2) Tables would take more memory (trees consume more space than hash-tables).
#(-3) Didn't actually solve the "problem of accustomed 0-based arrays".

Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Philippe Verdy
Nothing prevents an implementation to use a balanced tree approach instead of a hash.
Both approaches can be used simultaneously, depending on statistics factors, notably:
- the fill factor density for the hash, which is supposed to have a flattened distribution (if it uses a good enough hashing function avoiding the nasty effect of very long collision lists on very few hash slots)
- the estimated needed size for the static hash table.
A balanced tree (B-tree variants) can be very compact even for storage. This also basically depends on the size of the B-tree page and the maximum "degree" per node. Its cost however is proportional to the time to sort each page, and also depends on the balancing factors (what is the minimum fill factor to keep, and how many pages must be rearranged). But when filling and emptying large tables, this would require lot of page reorganizations (merges/splits) and yes it will be much slower than what a hashtable needs to do (rearrangements in a collision list are trivial).
Another approach would be to  combine a second independant hashing functions to manage the collision list.

Le dim. 9 juin 2019 à 16:03, Egor Skriptunoff <[hidden email]> a écrit :
On Sun, Jun 9, 2019 at 2:06 PM Soni "They/Them" L. wrote:
hi, so, a lot of ppl won't use Lua because Lua uses 1-based indexing.
well, other languages use 0-based indexing. but, what if we had neither?

the trivial solution for "neither" is just to use linked lists. and that
would, indeed, work, quite badly as well.but what if we come up with
something completely different instead?

rather than limiting ourselves to numeric representations of arrays, I
suggest a much more tangible (and yes tangible is the right term)
approach: filters.

consider the following table:

local t = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

(also consider as if t[1] etc don't work to access array indexes but
only hashtable indexes)

how do you iterate it? well you'd still use ipairs ofc and it'd be just
as fast, it just wouldn't use numbers. so you'd do "for value in
ipairs(table) do". additionally we can also remove array indexes from
pairs() because it doesn't make any sense to keep them there.

how would you do indexing? well, this is where things get interesting.
let's say you want to get that 5 there. you could do it like this:

local i = {false, false, false, false, true}
local v = table.get(t, i) -- v = 5

great! we no longer need numeric indexing! oh, wait...

uhh... how would we dynamically index things...

well, one thing is consistent between 0-based indexing and 1-based
indexing: in 0-based indexing, the length is always just beyond the last
valid index. in 1-based indexing, the length is the last valid index.
this happens to match up with the number of values in the array. as such
we can keep #t.

one option is to have "left, mid, right = table.bisect(t)". this is
quite efficient as well.

local left, mid, right = table.bisect(t) -- mid is nil
local left, mid, right = table.bisect(right) -- mid is 8
local left, mid, right = table.bisect(left) -- left is 6, right is 7,
mid is nil

(having a separate mid lets we ignore the problem of bisecting an odd
number and having to decide which side it'll go to.)

combine it with #t and you can trivially do 0-based or 1-based indexing
in your own code, and it just works. you can even mix them in the same
code, which is useful in some contexts.

okay so we have bisect and all but this isn't complete yet. to really
finish it off we need one more thing. well, four, actually:
table.pop_left, table.pop_right, table.push_left, table.push_right.
additionally, table.bisect should return views, not full arrays, both so
it's faster and so we can pop_left on a view and have it modify the
original table.

but anyway, this way, everyone's happy, because numeric indexing is no
more. or, well, actually, it's meant to make everyone unhappy I guess,
but you get the point xd.



(This is off-topic)
Your proposal reminds me another idea about storing table's keys ordered by value instead of ordered by hash:

Usually Lua table consists of "array" part and "hash" part.
The non-"array" part should store its data in a different way: it should be reorganized from "hash" part into "tree" part.
All keys in "tree" part of each Lua table should be arranged in some fixed order (imagine that we could compare two arbitrary Lua values of arbitrary datatypes).
Some kind of balanced-tree data structure should be implemented to make all table operations having O(log(N)) time complexity, where N is the number of keys in a "tree" part of a table.

This approach has the following benefits:

#(1) Operations "get next/prev key" will be O(1), as always.
#(2) Operations "get leftmost/rightmost key" will be O(log(N)).
These two items mean that we could implement next() as fast as in Lua 5.3.
Please note that "get leftmost key" operation was not O(1) in Lua 5.3, see the following example:
   while next(t) do t[next(t)] = nil end

#(3) Operations "move M elements to the left/right" will be O(log(N)) for large M.
Hence, we could access a table in a specific way which is independent on the base index:
old style: tbl[1]   new style: get_leftmost(tbl)
old style: tbl[5]   new style: move_right(get_leftmost(tbl), 4)
Of course, this syntax is not neat, it just gives the idea of operation.

#(4) Operation "split diapason key1..key2 into two equal parts" will be O(log(N)).
The primitive "dichotomy the range of keys" might be useful for many cases, including parallelization.

#(5) Balanced trees are immune to DDOS attacks (unlike hash-tables).
Lua would not have to pseudo-randomly shuffle the order of keys used by next() function on each VM initialization. 
Sequence of keys generated by pairs() will be deterministic (unless tables are used as keys).
For example, it would be convenient if all numerical keys go in their natural order when traversed with pairs():
-1.5, 0, 1, 2, 3, math.pi, 4, ..., math.huge

#(6) Memory block size for storing a balanced tree may be not equal to a power of 2 (as for hash tables).
For example, when extending a "tree" part of a growing table, any memory block 2x...4x of the previous size would be OK.
Memory for storing a balanced tree also can be fragmented into several continuous chunks without making the implementation more slower or more complex (unlike hash-tables).
These would make memory manager's life a bit easier.


Drawbacks:
#(-1) Table indexing will be slightly slower (hash tables, despite of collisions when indexing, are faster than balanced trees).
#(-2) Tables would take more memory (trees consume more space than hash-tables).
#(-3) Didn't actually solve the "problem of accustomed 0-based arrays".

Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Philippe Verdy
However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).

Cryptographic hashes are now very fast, and very resistant to DDOS attacks (it is really extremely difficult to "unflatten" the distribution over any hash table sise of reasonnable size). Even with SHA1 or MD5, it would be very computing intensive to create overlong collision lists desequilibrating the distribution of hash keys. The hashing function does not need to create unique keys for unique values, and the cost of of collision lists is proportional to their maximum length (which is inversely proporitional to the size of the hash table, which is turn should grow in size according to the number of keys/size in the table to index).

Let's suppose that the first level of hash table is sqrt(N) where N is the maximum number of elements in the table, then the average length of collision lists (with a flattened distribution) is sqrt(N)/2, and a good hashing function would make the worst collision list length no logner than sqrt(N). So you can easily use a second layer of hashing and then reduce the worst collision list length to O(sqrt(sqrt(N))). The storage cost for the two hashing tables (using the two hasing functions) will still be 2*O(sqrt(N)), and it is still modest compared to the cost of a B-tree (with moderate fill factor) because of the free space left in its pages.

All these designs are just based on statistics metrics that the implementation can maintain. It has many choices ! Lua however does not mandate any hashing function for its tables, so the ordering of keys wahen traversing tables with pairs() is already not predictable allowing freedom of choice. If you needed a table with sorted keys, you can already do that in Lua using a separate table of keys, ordered in a sequence (from 1 to N) independantly of the implementation of tables.

Because Lua is already optimizing the storage of table keys that form a sequence (they are not stored in the hash, but their in values are directly stored in a separate vector indexed by integer, and theses integer keys are then very fast and don't cost any additional storage, that's why ipairs() is very fast).
Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Philippe Verdy
As well, you can easily create a generic "sortedpairs" iterator: it just has to use pairs() to collect keys (in random order) and store them in a sequence [the table.keys() function already do that for you], then sort that sequence [using table.sort()], and then iterate with ipairs() on that sorted sequence to get ordered keys that can be directly used to index the given table.

Of course, there's a cost: each time you use such iterator, it must create (=possible a large allocation, i.e. temporary storage cost) and sort (=time cost) the sequence of keys.

But an application can also be built and maintain an index or ordered table keys "on the flow" and maintain the sorted sequence each time a table element is added or removed with a modest cost per key added/removed (it will generally be a bit slower to do that incrementally, than just building and sorting all keys at once after filling from scratch a new large table with many keys).


Le lun. 10 juin 2019 à 12:58, Philippe Verdy <[hidden email]> a écrit :
However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).

Cryptographic hashes are now very fast, and very resistant to DDOS attacks (it is really extremely difficult to "unflatten" the distribution over any hash table sise of reasonnable size). Even with SHA1 or MD5, it would be very computing intensive to create overlong collision lists desequilibrating the distribution of hash keys. The hashing function does not need to create unique keys for unique values, and the cost of of collision lists is proportional to their maximum length (which is inversely proporitional to the size of the hash table, which is turn should grow in size according to the number of keys/size in the table to index).

Let's suppose that the first level of hash table is sqrt(N) where N is the maximum number of elements in the table, then the average length of collision lists (with a flattened distribution) is sqrt(N)/2, and a good hashing function would make the worst collision list length no logner than sqrt(N). So you can easily use a second layer of hashing and then reduce the worst collision list length to O(sqrt(sqrt(N))). The storage cost for the two hashing tables (using the two hasing functions) will still be 2*O(sqrt(N)), and it is still modest compared to the cost of a B-tree (with moderate fill factor) because of the free space left in its pages.

All these designs are just based on statistics metrics that the implementation can maintain. It has many choices ! Lua however does not mandate any hashing function for its tables, so the ordering of keys wahen traversing tables with pairs() is already not predictable allowing freedom of choice. If you needed a table with sorted keys, you can already do that in Lua using a separate table of keys, ordered in a sequence (from 1 to N) independantly of the implementation of tables.

Because Lua is already optimizing the storage of table keys that form a sequence (they are not stored in the hash, but their in values are directly stored in a separate vector indexed by integer, and theses integer keys are then very fast and don't cost any additional storage, that's why ipairs() is very fast).
Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Gé Weijers
In reply to this post by Philippe Verdy


On Mon, Jun 10, 2019 at 3:58 AM Philippe Verdy <[hidden email]> wrote:
However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).

SHA-512 is not very fast when hashing small values. SHA-512 pads the value to at least 1024 bits, and then evaluate an 80 round function to complete the hash, which takes over 1000 cycles on Intel Skylake for instance. Hashing random integers, floats and short strings will be very slow because you always have to hash at least 128 bytes, which is mostly padding and a length field.

Some newer Intel and AMD processors have specialized instructions to speed up SHA-256, but even then you have to process at least 512bits over 64 rounds.

Almost-universal hash families are a better candidate for this kind of thing if your threat model includes hash collision DOS attacks. You do not need most of the properties of hash functions like SHA-512 to build a DOS attack resistant hashmap.



Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Philippe Verdy
Of course, I chose the extreme; but combining a few simple Knuth's modular hashes, whose multiplers are different primes (which can even be also randomly selected from a known set of primes when creating the hashmap, or when reindeing it), is genrally enough to resist most attacks without much cost.
I said SHA* are fast by experience with today's implementations (may be they are slow in the current Lua-only libraries, but Lua should provide a set of secure hashes by default because it has many usages.
And if your Lua table is indexed by keys that are critical data which have to be secured for privacy reasons (notably against response time-based side channels), you have to think about secure hashes.
If your hash function is still using a single basic Knuth multiplier with a small prime (like 17 or 21), your table is easily attackable and it is better to choose a large enough prime (at least 15 bits).

Le lun. 10 juin 2019 à 19:50, Gé Weijers <[hidden email]> a écrit :


On Mon, Jun 10, 2019 at 3:58 AM Philippe Verdy <[hidden email]> wrote:
However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).

SHA-512 is not very fast when hashing small values. SHA-512 pads the value to at least 1024 bits, and then evaluate an 80 round function to complete the hash, which takes over 1000 cycles on Intel Skylake for instance. Hashing random integers, floats and short strings will be very slow because you always have to hash at least 128 bytes, which is mostly padding and a length field.

Some newer Intel and AMD processors have specialized instructions to speed up SHA-256, but even then you have to process at least 512bits over 64 rounds.

Almost-universal hash families are a better candidate for this kind of thing if your threat model includes hash collision DOS attacks. You do not need most of the properties of hash functions like SHA-512 to build a DOS attack resistant hashmap.



Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Pavel
In reply to this post by Soni "They/Them" L.
Soni &quot;They/Them&quot; L. wrote
> hi, so, a lot of ppl won't use Lua because Lua uses 1-based indexing.
> well, other languages use 0-based indexing. but, what if we had neither?

Use or not to use some particular language only because of array indexing is
absolutely stupid.

Not all the languages have 'native' hardware pointers and thus pointer
arithmetic, with the array index being an offset from the array address, but
not actually index.
e.g. Wolfram Mathematica, Mathlab, R, Julia ...

0-based arrays have last element, which is array[LENGTH-1], instead of
array[LENGTH]. what's the difference where to add/subtract this 1?

ppl could also use 0-based arrays if they like
for i=0,#array-1 do print(array[i]) end

imho nothing to fix here.



--
Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html

Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Oliver Kroth
Hi,

I guess it must be
for i=0, #array do ... end
as array[0] does not count in #

--
Oliver

> for i=0,#array-1 do print(array[i]) end
>
>


Reply | Threaded
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Philippe Verdy
True but one must remember that index(0) is not enumerated by ipair() which only treats indexes that are part of an uninterrupted ordinal "sequence" from 1 to N.

And Lua implementations have normally optimized the table storage for sequences (so that their keys are not stored) but not necessarily for key 0 (which may still be accessed via the hash, containing all or most keys that are not part of a "sequence" except possibly a few integer keys < 1 or > #t, which may still remain in the internal indexed array for performance reasons, to avoid resizing or moving this array too frequently, when it can keep some unused slots assigned with a "nil" value; that internal array may then be "a bit" larger in size than #t, and used if preferably before adding/removing keys in the separate hash; unallocating large ranges of unused slots at end of the array may be done conditionally, only above some volume, and while still keeping that array allocated within an aligned size in memory to make better use of what the native memory allocator may have reserved).

Such optimization of tables for their "sequence" part offers many performance benefits (with a very modest complexity in the internal code for get/set accessors), it will almost always save memory because keys don't have to be stored and are just implicit positions in the array, and will avoid also the hashing function (which is not necessarily trivial, even for integer keys), the management of the collision lists for each slot in the hashtable, and multiple tests in a loop. All that is needed is to check the key type (is it an integer?) and value (is it in the min..max range of the optimized part which contains all the "sequence"?).

Le mar. 11 juin 2019 à 11:06, Oliver Kroth <[hidden email]> a écrit :
Hi,

I guess it must be
for i=0, #array do ... end
as array[0] does not count in #

--
Oliver

> for i=0,#array-1 do print(array[i]) end
>
>