[Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty

Patrick Donnelly
Hi Robert Paprocki,

(Sorry I didn't get your email. So lua-l gets to be the transition medium :)

Here is a messy proof-of-concept for handling IP addresses matching a
set of CIDRs at large scale. As we talked about, it works by
constructing a tree of CIDRs as an array and matching IP addresses
against it. (Note: the array is static size and holds the entire 32bit
ip address space as a 2GB array. However, this malloc'd memory is
mostly unused and due to memory overcommit in Linux, the actual memory
of your RSS is much less.)

I was able to achieve the following numbers for insertion (which
includes significant time spent creating IP addresses via
math.random):

pdonnell@icewind ~/Downloads/LuaJIT-2.1.0-beta3/src$ ./luajit  ~/cidr-match.lua
CIDRs inserted per second (100000): 986747
IPv4 addr searched per second (100000): 838272
CIDRs inserted per second (1e+06): 2723912
IPv4 addr searched per second (1e+06): 2027817
CIDRs inserted per second (1e+07): 2589888
IPv4 addr searched per second (1e+07): 1621741

pdonnell@icewind ~/Downloads/LuaJIT-2.1.0-beta3/src$ luajit  ~/cidr-match.lua
CIDRs inserted per second (100000): 278164
IPv4 addr searched per second (100000): 154175
CIDRs inserted per second (1e+06): 336277
IPv4 addr searched per second (1e+06): 193654
CIDRs inserted per second (1e+07): 223287
IPv4 addr searched per second (1e+07): 117337

(For some reason, luajit 2.0 -- the second run above -- is much slower
than 2.1.)

Here's the memory usage with 1 million CIDRs:

$ ./luajit  ~/cidr-match.lua
VmPeak:  2109684 kB
VmSize:  2109684 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:    683128 kB
VmRSS:    683128 kB
VmData:  2097480 kB
VmStk:       132 kB
VmExe:       480 kB
VmLib:      3364 kB
VmPTE:      1384 kB
VmPMD:        28 kB
VmSwap:        0 kB

Note the important bit there VmRSS which indicates it's actually using
680MB for 1 million CIDRs... so not quite that small :). Memory use
for normal use-cases may be very different.

The code is rough and doesn't handle network byte order correctly but
that should be easy enough to fix. Exercise for the reader and all
that.

Thanks for the fun programming challenge and thanks to Kong for
hosting the Lua workshop!

--
Patrick Donnelly

cidr-match.lua (7K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty

Sean Conner
It was thus said that the Great Patrick Donnelly once stated:

> Hi Robert Paprocki,
>
> (Sorry I didn't get your email. So lua-l gets to be the transition medium :)
>
> Here is a messy proof-of-concept for handling IP addresses matching a
> set of CIDRs at large scale. As we talked about, it works by
> constructing a tree of CIDRs as an array and matching IP addresses
> against it. (Note: the array is static size and holds the entire 32bit
> ip address space as a 2GB array. However, this malloc'd memory is
> mostly unused and due to memory overcommit in Linux, the actual memory
> of your RSS is much less.)

  There is a much more space efficient way of storing this information and
the retrieval time isn't bad (the average number of comparisons is the
average length of the netmask---worse case is 32, the length of an entire IP
address, regardless of the number of entries).

  On the down side, it's easy to implement, but not quite so easy to
describe.  I'll try, with four bit addresses.

  Assume the following addresses/masks:

        0000/0000 (our default match for "no match")
        1000/1000 addresses 8-15
        0100/1100 addresses 4-7
        0010/1111 address 2

So address 0, 1 and 3 (if my data is correct) should be a "no match".  A
constructed tree will look like:

                  [*] 0000/0000
                  / \
                 0   1
                /     \
              [ ]     [*] 1000/1000
              / \    
             0   1  
            /     \
          [ ]     [*] 0100/1100
            \
             1
              \
              [ ]
              /
             0
            /
          [*] 0010/1111

Each node has a match flag (true or false), and links for 0 or 1 bits.  When
constructing, you use the mask to determine which bits of the address are
used for matching.  To find an address, start at the root node:

        while true do
          if node.match then
            found = node
          end
         
          next = address:nextbit()

          if not next then
            return found
          end
         
          if not node[next] then
            return found
          end
         
          node = node[next]
        end

Using an address of 1001, we start at the root.  There is a match, so we set
'found' to the root node.  There is another bit (1) so we follow that.  That
node is a match, so we set 'found' to that node.  There is another bit (0),
but there are no more nodes to check, so we return the last node we matched.

Let's try address 0011.  Again, there's a match at the root, so 'found' is
the root.  There's more bits (0), so we follow that.  No match, but there
are still more bits (0), so we follow that again.  Again, no match, but
there are still more bits (1 this time).  We can still follow.  More bits
(1), but this time, there is no link for said bit, so we return 'found',
which is the root node (no match).

  You can extend this out to 32 bits easily.  I'll leave constructing the
table as an exercise for the reader.

  -spc (Can you tell I implemented this?  Only in C, not Lua (it was a
        long time agao ... ))

Reply | Threaded
Open this post in threaded view
|

Re: [Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty

dyngeccetor8
On 10/18/2017 06:17 AM, Sean Conner wrote:

>   Assume the following addresses/masks:
>
> 0000/0000 (our default match for "no match")
> 1000/1000 addresses 8-15
> 0100/1100 addresses 4-7
> 0010/1111 address 2
>
> So address 0, 1 and 3 (if my data is correct) should be a "no match".  A
> constructed tree will look like:
>
>                   [*] 0000/0000
>                   / \
>                  0   1
>                 /     \
>               [ ]     [*] 1000/1000
>               / \    
>              0   1  
>             /     \
>           [ ]     [*] 0100/1100
>             \
>              1
>               \
>               [ ]
>               /
>              0
>             /
>           [*] 0010/1111
>

Looks like TRIE structure for me.

-- Martin

Reply | Threaded
Open this post in threaded view
|

Re: [Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty

William Ahern
On Thu, Oct 19, 2017 at 01:33:45PM +0100, dyngeccetor8 wrote:

> On 10/18/2017 06:17 AM, Sean Conner wrote:
> >   Assume the following addresses/masks:
> >
> > 0000/0000 (our default match for "no match")
> > 1000/1000 addresses 8-15
> > 0100/1100 addresses 4-7
> > 0010/1111 address 2
> >
> > So address 0, 1 and 3 (if my data is correct) should be a "no match".  A
> > constructed tree will look like:
> >
> >                   [*] 0000/0000
> >                   / \
> >                  0   1
> >                 /     \
> >               [ ]     [*] 1000/1000
> >               / \    
> >              0   1  
> >             /     \
> >           [ ]     [*] 0100/1100
> >             \
> >              1
> >               \
> >               [ ]
> >               /
> >              0
> >             /
> >           [*] 0010/1111
> >
>
> Looks like TRIE structure for me.

More specifically, a crit-bit tree.

  https://cr.yp.to/critbit.html
 
 

Reply | Threaded
Open this post in threaded view
|

Re: [Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty

Russell Haley
On Thu, Oct 19, 2017 at 10:47 AM, William Ahern
<[hidden email]> wrote:

> On Thu, Oct 19, 2017 at 01:33:45PM +0100, dyngeccetor8 wrote:
>> On 10/18/2017 06:17 AM, Sean Conner wrote:
>> >   Assume the following addresses/masks:
>> >
>> >     0000/0000       (our default match for "no match")
>> >     1000/1000       addresses 8-15
>> >     0100/1100       addresses 4-7
>> >     0010/1111       address 2
>> >
>> > So address 0, 1 and 3 (if my data is correct) should be a "no match".  A
>> > constructed tree will look like:
>> >
>> >                   [*]               0000/0000
>> >                   / \
>> >                  0   1
>> >                 /     \
>> >               [ ]     [*]   1000/1000
>> >               / \
>> >              0   1
>> >             /     \
>> >           [ ]     [*]               0100/1100
>> >             \
>> >              1
>> >               \
>> >               [ ]
>> >               /
>> >              0
>> >             /
>> >           [*]                       0010/1111
>> >
>>
>> Looks like TRIE structure for me.
>
> More specifically, a crit-bit tree.
>
>   https://cr.yp.to/critbit.html

I ❤ Daniel Bernstein. He's my hero.

Russ

Reply | Threaded
Open this post in threaded view
|

Re: [Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty

Eric Wing
Hello Robert Paprocki,

I had a couple of hand-wavy supplemental thoughts from your original
presentation…it may or may not be useful. (It also may be inaccurate
as I am doing this from memory a day after the talk.) Feel free to
disregard.

Using the distinction of optimization that separates efficiency and
performance into two separate buckets, where efficiency is about
avoiding work (which is what your binary search does), my thoughts
come on the performance side, with respect to getting the most out of
the hardware for the work actually being done.

In your table of benchmarks in the presentation, for the smaller
numbers of addresses, you observed that a linear algorithm beat binary
search. You suspected this was due to the overhead of the JIT. While
there is probably a JIT overhead component to this, my guess is that
this is also an effect of memory locality in the hardware. If you are
not familiar with the issue, the gist of the idea is that the
differential between CPU clock speeds and the memory bus clock speed
is now so great, RAM access is like the new DiskIO. A modern Intel CPU
can do a sqrt in something like 12 cycles, but a memory fetch for just
4 bytes from RAM can be like 500 cycles. So your CPU is stalled every
time this happens. So the way to get more performance out of the
hardware is to try to have your data in your CPU cache, where an L2
cache hit is ~12 cycles vs. ~500 for RAM. A linear search through
contiguous memory is can be very fast because of the CPU doing
prefetching into the cache. A binary search on the other hand can
result in a cache miss every jump.

So, _IF_ this a significant component in the benchmarks you are
seeing, one additional optimization would be to try to harness the
performance of memory locality.

So my first thought is you could implement a hybrid algorithm. Start
with your binary search (for efficiency), but stop dividing when the
remaining block of remaining addresses is smaller than a certain
threshold (this number could be around those sizes in your benchmarks
where linear search beat your binary search), Then you do a linear
search within that block. The hardware idea is that your remaining
values are in a contiguous array of memory and you will be fetching
predictably in order, so the CPU prefetching will pull all the values
into your CPU cache in time for you to access them. The hope is (based
on the benchmarks you showed), you will get the efficiency wins of
avoiding work by doing most of the binary search, but in the final
parts of the binary search where the number of addresses is smaller
and linear wins (in your benchmarks), you get to take advantage of
that. (A large assumption in my thinking is that your binary search
even in the smaller range still causes lots of cache misses because
you are jumping around in memory.)

There are also some variations of this idea…
For example, I’m assuming you are on server class x86-64 machines with
lots of CPUs/cores, and possibly even running clusters. If you could
break up the list of addresses across all your cores/CPUs, where each
only needs to focus on a smaller subset of the list (basically
parallel binary search) then you could might be able to crunch the
result faster. I don’t know how easy it is to do this, especially with
LuaJIT, (separate lua_States or something like Lanes?)



Another idea is to also exploit SIMD. It seems like what you are doing
involves the same integer comparison over and over again. Seems like
their may be an opportunity to do them wide, especially in the linear
search phase. You should be able to do 4 32-bit comparisons all in one
shot for the same cost on any x86 hardware. If you have AVX-512, you
could do 16 comparisons in one shot. I don’t know what what kind
if-any SIMD support LuaJIT has.



Thank you and the rest of Kong for hosting the Lua Workshop.
- Eric Wing

Reply | Threaded
Open this post in threaded view
|

Re: [Lua Workshop 2017] RE: Efficient Layer 7 Search of IP Address Space in LuaJIT/OpenResty

Javier Guerra Giraldez
On 19 October 2017 at 22:52, Eric Wing <[hidden email]> wrote:
> In your table of benchmarks in the presentation, for the smaller
> numbers of addresses, you observed that a linear algorithm beat binary
> search. You suspected this was due to the overhead of the JIT. While
> there is probably a JIT overhead component to this, my guess is that
> this is also an effect of memory locality in the hardware. If you are


I've done some experiments on this.

unfortunately, the obvious way to do binary search is very hostile to
LuaJIT, due to the way it builds individual traces.  The condition at
the center of a binary search is (by design) well balanced, which
means the loop doesn't converge to a "main" path with seldom-executed
sidetraces.  instead, it creates a knot of short traces tied into each
other, unable to get the biggest optimizations available in LuaJIT.

using bit arithmetic to replace the condition helps a lot, but
introduces other constraints.  i wasn't able to keep the central
variables into CPU registers, so the code still wasn't as fast as
possible in C.

directly comparing IPv6 addresses (128 bits) using only 64-bit
quantities reintroduces conditionals within the inner loops.  I think
the suggested strategy is to split IP addresses into layers.  I've
seen IPv4 (32 bit) done with a 8-16-8 split.  for ipv6 i guess it
would have to be something like 48-64-16.

when doing the same in C, i tried a B*tree approach with
cache-line-sized "blocks" and using SSE operations to scan within.
the idea is that the cache-memory split looks a lot like the
memory-disk split that makes B*trees so popular for databases and
filesystems.  for IPv4 it makes some sense, since an AVX2 register can
hold 8 addresses, and a cache line holds 16; but for IPv6 it's only 2
and 4, respectively.  worse, there's no 128-bit comparison, so it's
not easy to do operations "in parallel", SIMD style.

In the end, i used the simplest binary search, written in C, using the
non-standard-but-widely-supported __uint128_t integer size.  the code
machine code generated for 128-bit comparisons avoids branches by
using conditional execution.  that's a trick most C compilers can do,
but only for very short snippets.  in this case, it's a single
instruction, so it neatly avoids code locality issues.


--
Javier