So I had this weird idea to solve the indexing problem Classic List Threaded 10 messages Open this post in threaded view
|

So I had this weird idea to solve the indexing problem

 hi, so, a lot of ppl won't use Lua because Lua uses 1-based indexing. well, other languages use 0-based indexing. but, what if we had neither? the trivial solution for "neither" is just to use linked lists. and that would, indeed, work, quite badly as well.but what if we come up with something completely different instead? rather than limiting ourselves to numeric representations of arrays, I suggest a much more tangible (and yes tangible is the right term) approach: filters. consider the following table: local t = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} (also consider as if t etc don't work to access array indexes but only hashtable indexes) how do you iterate it? well you'd still use ipairs ofc and it'd be just as fast, it just wouldn't use numbers. so you'd do "for value in ipairs(table) do". additionally we can also remove array indexes from pairs() because it doesn't make any sense to keep them there. how would you do indexing? well, this is where things get interesting. let's say you want to get that 5 there. you could do it like this: local i = {false, false, false, false, true} local v = table.get(t, i) -- v = 5 great! we no longer need numeric indexing! oh, wait... uhh... how would we dynamically index things... well, one thing is consistent between 0-based indexing and 1-based indexing: in 0-based indexing, the length is always just beyond the last valid index. in 1-based indexing, the length is the last valid index. this happens to match up with the number of values in the array. as such we can keep #t. one option is to have "left, mid, right = table.bisect(t)". this is quite efficient as well. local left, mid, right = table.bisect(t) -- mid is nil local left, mid, right = table.bisect(right) -- mid is 8 local left, mid, right = table.bisect(left) -- left is 6, right is 7, mid is nil (having a separate mid lets we ignore the problem of bisecting an odd number and having to decide which side it'll go to.) combine it with #t and you can trivially do 0-based or 1-based indexing in your own code, and it just works. you can even mix them in the same code, which is useful in some contexts. okay so we have bisect and all but this isn't complete yet. to really finish it off we need one more thing. well, four, actually: table.pop_left, table.pop_right, table.push_left, table.push_right. additionally, table.bisect should return views, not full arrays, both so it's faster and so we can pop_left on a view and have it modify the original table. but anyway, this way, everyone's happy, because numeric indexing is no more. or, well, actually, it's meant to make everyone unhappy I guess, but you get the point xd.
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

 However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).Cryptographic hashes are now very fast, and very resistant to DDOS attacks (it is really extremely difficult to "unflatten" the distribution over any hash table sise of reasonnable size). Even with SHA1 or MD5, it would be very computing intensive to create overlong collision lists desequilibrating the distribution of hash keys. The hashing function does not need to create unique keys for unique values, and the cost of of collision lists is proportional to their maximum length (which is inversely proporitional to the size of the hash table, which is turn should grow in size according to the number of keys/size in the table to index).Let's suppose that the first level of hash table is sqrt(N) where N is the maximum number of elements in the table, then the average length of collision lists (with a flattened distribution) is sqrt(N)/2, and a good hashing function would make the worst collision list length no logner than sqrt(N). So you can easily use a second layer of hashing and then reduce the worst collision list length to O(sqrt(sqrt(N))). The storage cost for the two hashing tables (using the two hasing functions) will still be 2*O(sqrt(N)), and it is still modest compared to the cost of a B-tree (with moderate fill factor) because of the free space left in its pages.All these designs are just based on statistics metrics that the implementation can maintain. It has many choices ! Lua however does not mandate any hashing function for its tables, so the ordering of keys wahen traversing tables with pairs() is already not predictable allowing freedom of choice. If you needed a table with sorted keys, you can already do that in Lua using a separate table of keys, ordered in a sequence (from 1 to N) independantly of the implementation of tables.Because Lua is already optimizing the storage of table keys that form a sequence (they are not stored in the hash, but their in values are directly stored in a separate vector indexed by integer, and theses integer keys are then very fast and don't cost any additional storage, that's why ipairs() is very fast).
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

 As well, you can easily create a generic "sortedpairs" iterator: it just has to use pairs() to collect keys (in random order) and store them in a sequence [the table.keys() function already do that for you], then sort that sequence [using table.sort()], and then iterate with ipairs() on that sorted sequence to get ordered keys that can be directly used to index the given table.Of course, there's a cost: each time you use such iterator, it must create (=possible a large allocation, i.e. temporary storage cost) and sort (=time cost) the sequence of keys.But an application can also be built and maintain an index or ordered table keys "on the flow" and maintain the sorted sequence each time a table element is added or removed with a modest cost per key added/removed (it will generally be a bit slower to do that incrementally, than just building and sorting all keys at once after filling from scratch a new large table with many keys).Le lun. 10 juin 2019 à 12:58, Philippe Verdy <[hidden email]> a écrit :However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).Cryptographic hashes are now very fast, and very resistant to DDOS attacks (it is really extremely difficult to "unflatten" the distribution over any hash table sise of reasonnable size). Even with SHA1 or MD5, it would be very computing intensive to create overlong collision lists desequilibrating the distribution of hash keys. The hashing function does not need to create unique keys for unique values, and the cost of of collision lists is proportional to their maximum length (which is inversely proporitional to the size of the hash table, which is turn should grow in size according to the number of keys/size in the table to index).Let's suppose that the first level of hash table is sqrt(N) where N is the maximum number of elements in the table, then the average length of collision lists (with a flattened distribution) is sqrt(N)/2, and a good hashing function would make the worst collision list length no logner than sqrt(N). So you can easily use a second layer of hashing and then reduce the worst collision list length to O(sqrt(sqrt(N))). The storage cost for the two hashing tables (using the two hasing functions) will still be 2*O(sqrt(N)), and it is still modest compared to the cost of a B-tree (with moderate fill factor) because of the free space left in its pages.All these designs are just based on statistics metrics that the implementation can maintain. It has many choices ! Lua however does not mandate any hashing function for its tables, so the ordering of keys wahen traversing tables with pairs() is already not predictable allowing freedom of choice. If you needed a table with sorted keys, you can already do that in Lua using a separate table of keys, ordered in a sequence (from 1 to N) independantly of the implementation of tables.Because Lua is already optimizing the storage of table keys that form a sequence (they are not stored in the hash, but their in values are directly stored in a separate vector indexed by integer, and theses integer keys are then very fast and don't cost any additional storage, that's why ipairs() is very fast).
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

 In reply to this post by Philippe Verdy On Mon, Jun 10, 2019 at 3:58 AM Philippe Verdy <[hidden email]> wrote:However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).SHA-512 is not very fast when hashing small values. SHA-512 pads the value to at least 1024 bits, and then evaluate an 80 round function to complete the hash, which takes over 1000 cycles on Intel Skylake for instance. Hashing random integers, floats and short strings will be very slow because you always have to hash at least 128 bytes, which is mostly padding and a length field.Some newer Intel and AMD processors have specialized instructions to speed up SHA-256, but even then you have to process at least 512bits over 64 rounds.Almost-universal hash families are a better candidate for this kind of thing if your threat model includes hash collision DOS attacks. You do not need most of the properties of hash functions like SHA-512 to build a DOS attack resistant hashmap.Gé
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

 Of course, I chose the extreme; but combining a few simple Knuth's modular hashes, whose multiplers are different primes (which can even be also randomly selected from a known set of primes when creating the hashmap, or when reindeing it), is genrally enough to resist most attacks without much cost.I said SHA* are fast by experience with today's implementations (may be they are slow in the current Lua-only libraries, but Lua should provide a set of secure hashes by default because it has many usages.And if your Lua table is indexed by keys that are critical data which have to be secured for privacy reasons (notably against response time-based side channels), you have to think about secure hashes.If your hash function is still using a single basic Knuth multiplier with a small prime (like 17 or 21), your table is easily attackable and it is better to choose a large enough prime (at least 15 bits).Le lun. 10 juin 2019 à 19:50, Gé Weijers <[hidden email]> a écrit :On Mon, Jun 10, 2019 at 3:58 AM Philippe Verdy <[hidden email]> wrote:However the nasty unbalanced hash trees only occur if you have not used a good enough hashing function; the implementation, above some table size, could arrange to use a stronger cryptographic function (e.g. SHA512, but still SHA1 is still very resistant) which will ensure a really flattened distribution, instead of a simple modular arithmetic function with the multiplication by known static prime (like the simple Knut's hashing function).SHA-512 is not very fast when hashing small values. SHA-512 pads the value to at least 1024 bits, and then evaluate an 80 round function to complete the hash, which takes over 1000 cycles on Intel Skylake for instance. Hashing random integers, floats and short strings will be very slow because you always have to hash at least 128 bytes, which is mostly padding and a length field.Some newer Intel and AMD processors have specialized instructions to speed up SHA-256, but even then you have to process at least 512bits over 64 rounds.Almost-universal hash families are a better candidate for this kind of thing if your threat model includes hash collision DOS attacks. You do not need most of the properties of hash functions like SHA-512 to build a DOS attack resistant hashmap.Gé
Open this post in threaded view
|

Re: So I had this weird idea to solve the indexing problem

 In reply to this post by Soni "They/Them" L. Soni "They/Them" L. wrote > hi, so, a lot of ppl won't use Lua because Lua uses 1-based indexing. > well, other languages use 0-based indexing. but, what if we had neither? Use or not to use some particular language only because of array indexing is absolutely stupid. Not all the languages have 'native' hardware pointers and thus pointer arithmetic, with the array index being an offset from the array address, but not actually index. e.g. Wolfram Mathematica, Mathlab, R, Julia ... 0-based arrays have last element, which is array[LENGTH-1], instead of array[LENGTH]. what's the difference where to add/subtract this 1? ppl could also use 0-based arrays if they like for i=0,#array-1 do print(array[i]) end imho nothing to fix here. -- Sent from: http://lua.2524044.n2.nabble.com/Lua-l-f2524044.html