'setobj' in lua-5.4.0-alpha-rc2 become more faster

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

'setobj' in lua-5.4.0-alpha-rc2 become more faster

重归混沌
I test the "lua-5.4.0-alpha-rc2" recently, and found that the 'forloop' is more faster than 'lua5.3'。

The test code is that:
`
collectgarbage("stop")
local function foo()
     local a = 3
     for i = 1, 64 * 1024 * 1024 do
         a = i
     end
     print(a)
end
foo()
`

In the end, I found that a magic modification(https://github.com/lua/lua/commit/4bb30f461b146e1d189ee301472953e948699acf) by Roberto can greatly improve the efficiency of 'setobj'(a=i).

The modification is as follows:

`
     #define setobj(L,obj1,obj2) \
-    { TValue *io1=(obj1); *io1 = *(obj2); \
+    { TValue *io1=(obj1); const TValue *io2=(obj2); \
+         io1->value_ = io2->value_; io1->tt_ = io2->tt_; \
           (void)L; checkliveness(L,io1); }

`

I disassembled this change, and can't find the key point.

So, who can't point me the key? Thank you very much.

Sorry for my bad English.  
Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

云风 Cloud Wu
重归混沌 <[hidden email]> 于2019年6月14日周五 下午2:33写道:

> `
>      #define setobj(L,obj1,obj2) \
> -    { TValue *io1=(obj1); *io1 = *(obj2); \
> +    { TValue *io1=(obj1); const TValue *io2=(obj2); \
> +         io1->value_ = io2->value_; io1->tt_ = io2->tt_; \
>            (void)L; checkliveness(L,io1); }
>
> `
>
> I disassembled this change, and can't find the key point.
>
> So, who can't point me the key? Thank you very much.

Because the sizeof(TValue) is 16 bytes on 64 bit platform, but
sizeof(obj->value_) + sizeof(obj->tt_) is 9 bytes.


--
http://blog.codingnow.com

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

重归混沌
I test four size of 'tt_' with each struct assign method, the result is amazing.

#define setobj_X(L,obj1,obj2) \
  { TValue *io1=(obj1); *io1 = *(obj2); \
         (void)L; checkliveness(L,io1); }

#define setobj_Y(L,obj1,obj2) \
  { TValue *io1=(obj1); const TValue *io2=(obj2); \
     io1->value_ = io2->value_; io1->tt_ = io2->tt_; \
     (void)L; checkliveness(L,io1); }


 typeof(tt_)    char       short        int         long

 setobj_X     0.55s      0.55s       0.55s    0.41s

 setobj_Y     0.52s      0.43s       0.42s    0.42s

云风 Cloud Wu <[hidden email]> 于2019年6月14日周五 下午6:16写道:

Because the sizeof(TValue) is 16 bytes on 64 bit platform, but
sizeof(obj->value_) + sizeof(obj->tt_) is 9 bytes.

--
http://blog.codingnow.com

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

Javier Guerra Giraldez
In reply to this post by 云风 Cloud Wu
On Fri, 14 Jun 2019 at 11:16, 云风 Cloud Wu <[hidden email]> wrote:
>
> Because the sizeof(TValue) is 16 bytes on 64 bit platform, but
> sizeof(obj->value_) + sizeof(obj->tt_) is 9 bytes.

i'd also guess that copying the struct uses memcpy(), while the couple
of assignments might be optimised for the common case where the values
are already in some register.  even if memcpy() is frequently inlined,
it might need to flush down any value to memory before copying.

--
Javier

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

Andrew Gierth
In reply to this post by 重归混沌
>>>>> ">" == 重归混沌  <[hidden email]> writes:

 >> I test four size of 'tt_' with each struct assign method, the result
 >> is amazing.

 >> #define setobj_X(L,obj1,obj2) \
 >>   { TValue *io1=(obj1); *io1 = *(obj2); \
 >>          (void)L; checkliveness(L,io1); }

 >> #define setobj_Y(L,obj1,obj2) \
 >>   { TValue *io1=(obj1); const TValue *io2=(obj2); \
 >>       io1-> value_ = io2->value_; io1->tt_ = io2->tt_; \
 >>      (void)L; checkliveness(L,io1); }

How did you test this? because undoing this change breaks table
handling, since NodeKey aliases TValue in a way that isn't (as far as I
know) strictly kosher. (I get "table index is nil" errors from trying
it)

 >> Because the sizeof(TValue) is 16 bytes on 64 bit platform, but
 >> sizeof(obj->value_) + sizeof(obj->tt_) is 9 bytes.

There should be no performance difference between copying 9 and 16
bytes.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

重归混沌


Andrew Gierth <[hidden email]> 于2019年6月14日周五 下午11:46写道:
How did you test this? because undoing this change breaks table
handling, since NodeKey aliases TValue in a way that isn't (as far as I
know) strictly kosher. (I get "table index is nil" errors from trying
it)
 
 I test it in lua5.3.4 source code. 
 
Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

Andrew Gierth
>>>>> ">" == 重归混沌  <[hidden email]> writes:

 >>  I test it in lua5.3.4 source code.

Aha.

What compiler were you using? Because I just tried it with clang 8.0.0
and gcc8, and the results are pretty fascinating (but quite unexpected).
The newer version of setobj is _much_ faster for me on that test code (I
increased the iteration count to 512*1024*1024, and the timings are 3.0
seconds vs. 4.1 seconds(!)), but for reasons that seem to have basically
nothing to do with the code.

The test program compiles into bytecode that alternately executes
OP_MOVE and OP_FORLOOP. OP_MOVE is this, in the source code:

      vmcase(OP_MOVE) {
        setobjs2s(L, ra, RB(i));
        vmbreak;
      }

I'm getting this compiled code with the original setobj, using clang8
with -O2 -march=core2:

803           vmcase(OP_MOVE) {
804             setobjs2s(L, ra, RB(i));
   0x0000000000417390 <+256>:   shr    $0x13,%r11
   0x0000000000417394 <+260>:   and    $0xfffffff0,%r11d
   0x0000000000417398 <+264>:   mov    (%r9,%r11,1),%rax
   0x000000000041739c <+268>:   mov    0x8(%r9,%r11,1),%rcx
   0x00000000004173a1 <+273>:   jmpq   0x417b2c <luaV_execute+2204>

[...]

842           vmcase(OP_GETTABLE) {

[...]
   0x0000000000417b2c <+2204>:  mov    %rcx,0x8(%r12)
   0x0000000000417b31 <+2209>:  mov    %rax,(%r12)
   0x0000000000417b35 <+2213>:  jmpq   0x417330 <luaV_execute+160>
   0x0000000000417b3a <+2218>:  mov    %rbx,-0x68(%rbp)

1019            vmbreak;

In other words, the compiler is splitting up the load and the store in
OP_MOVE's setobj and implementing the store part by jumping to identical
code at the end of OP_GETTABLE. The new setobj on the other hand gives:

803           vmcase(OP_MOVE) {
804             setobjs2s(L, ra, RB(i));
   0x0000000000417160 <+256>:   shr    $0x13,%r10
   0x0000000000417164 <+260>:   and    $0xfffffff0,%r10d
   0x0000000000417168 <+264>:   mov    (%r8,%r10,1),%rax
   0x000000000041716c <+268>:   mov    %rax,(%r12)
   0x0000000000417170 <+272>:   mov    0x8(%r8,%r10,1),%eax
   0x0000000000417175 <+277>:   mov    %eax,0x8(%r8,%rbx,1)
   0x000000000041717a <+282>:   jmp    0x417100 <luaV_execute+160>

so there's no intermediate jump, just loads and stores.

You wouldn't expect one unconditional jump in a sequence like this to
make a huge difference, but analyzing the code with performance monitors
shows a very large difference in the number of resource stalls, and the
unconditional jmpq   0x417b2c <luaV_execute+2204> is the primary hotspot
for them.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

重归混沌
I test it with 'gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)'.

My disassembly doesn't similar to yours(I used `gcc -g -O2 -c lvm.s` and `objdump -j .text -S lvm.o`).

the old setobj ------------------------>
    vmfetch();
    vmdispatch (GET_OPCODE(i)) {
      vmcase(OP_MOVE) {
        setobjs2s(L, ra, RB(i));
    1f50: 41 c1 ed 17           shr    $0x17,%r13d
    1f54: 49 c1 e5 04           shl    $0x4,%r13
    1f58: 4b 8b 04 2f           mov    (%r15,%r13,1),%rax
    1f5c: 4b 8b 54 2f 08       mov    0x8(%r15,%r13,1),%rdx
    1f61: 48 89 03             mov    %rax,(%rbx)
    1f64: 48 89 53 08           mov    %rdx,0x8(%rbx)
    1f68: 48 8b 75 28           mov    0x28(%rbp),%rsi
        vmbreak;
    1f6c: e9 6f f3 ff ff       jmpq   12e0 <luaV_execute+0x40>
    1f71: 0f 1f 80 00 00 00 00 nopl   0x0(%rax)
        vmbreak;
      }
the new one<---------------------------------------------------------------
    vmfetch();
    vmdispatch (GET_OPCODE(i)) {
      vmcase(OP_MOVE) {
        setobjs2s(L, ra, RB(i));
    1f08: 41 c1 ed 17           shr    $0x17,%r13d
    1f0c: 49 c1 e5 04           shl    $0x4,%r13
    1f10: 4d 01 fd             add    %r15,%r13
        vmbreak;
      }
      vmcase(OP_LOADK) {
        TValue *rb = k + GETARG_Bx(i);
        setobj2s(L, ra, rb);
    1f13: 49 8b 45 00           mov    0x0(%r13),%rax
    1f17: 48 89 03             mov    %rax,(%rbx)
    1f1a: 41 8b 45 08           mov    0x8(%r13),%eax
    1f1e: 89 43 08             mov    %eax,0x8(%rbx)
    1f21: 48 8b 45 28           mov    0x28(%rbp),%rax
        vmbreak;
    1f25: e9 a6 f3 ff ff       jmpq   12d0 <luaV_execute+0x40>
    1f2a: 66 0f 1f 44 00 00     nopw   0x0(%rax,%rax,1)
        vmbreak;
      }
----------------------------------------------------------------------------------

In the disassembly, the difference is:

1.  `mov    (%r15,%r13,1),%rax`  vs `add    %r15,%r13` and `mov    0x0(%r13),%rax`
2. old setobj is load,load,store,store, and new setobj is load,store,load,store
3. the register size is different. when assign tt_

But, it seems that no matter how the compiler generate code, the new setobj is always faster old setobj.

Maybe compiler want to hint something?

By the way, can you tell me what's the `performance monitors` you used.
 


Andrew Gierth <[hidden email]> 于2019年6月15日周六 上午2:17写道:
>>>>> ">" == 重归混沌  <[hidden email]> writes:

 >>  I test it in lua5.3.4 source code.

Aha.

What compiler were you using? Because I just tried it with clang 8.0.0
and gcc8, and the results are pretty fascinating (but quite unexpected).
The newer version of setobj is _much_ faster for me on that test code (I
increased the iteration count to 512*1024*1024, and the timings are 3.0
seconds vs. 4.1 seconds(!)), but for reasons that seem to have basically
nothing to do with the code.

The test program compiles into bytecode that alternately executes
OP_MOVE and OP_FORLOOP. OP_MOVE is this, in the source code:

      vmcase(OP_MOVE) {
        setobjs2s(L, ra, RB(i));
        vmbreak;
      }

I'm getting this compiled code with the original setobj, using clang8
with -O2 -march=core2:

803           vmcase(OP_MOVE) {
804             setobjs2s(L, ra, RB(i));
   0x0000000000417390 <+256>:   shr    $0x13,%r11
   0x0000000000417394 <+260>:   and    $0xfffffff0,%r11d
   0x0000000000417398 <+264>:   mov    (%r9,%r11,1),%rax
   0x000000000041739c <+268>:   mov    0x8(%r9,%r11,1),%rcx
   0x00000000004173a1 <+273>:   jmpq   0x417b2c <luaV_execute+2204>

[...]

842           vmcase(OP_GETTABLE) {

[...]
   0x0000000000417b2c <+2204>:  mov    %rcx,0x8(%r12)
   0x0000000000417b31 <+2209>:  mov    %rax,(%r12)
   0x0000000000417b35 <+2213>:  jmpq   0x417330 <luaV_execute+160>
   0x0000000000417b3a <+2218>:  mov    %rbx,-0x68(%rbp)

1019            vmbreak;

In other words, the compiler is splitting up the load and the store in
OP_MOVE's setobj and implementing the store part by jumping to identical
code at the end of OP_GETTABLE. The new setobj on the other hand gives:

803           vmcase(OP_MOVE) {
804             setobjs2s(L, ra, RB(i));
   0x0000000000417160 <+256>:   shr    $0x13,%r10
   0x0000000000417164 <+260>:   and    $0xfffffff0,%r10d
   0x0000000000417168 <+264>:   mov    (%r8,%r10,1),%rax
   0x000000000041716c <+268>:   mov    %rax,(%r12)
   0x0000000000417170 <+272>:   mov    0x8(%r8,%r10,1),%eax
   0x0000000000417175 <+277>:   mov    %eax,0x8(%r8,%rbx,1)
   0x000000000041717a <+282>:   jmp    0x417100 <luaV_execute+160>

so there's no intermediate jump, just loads and stores.

You wouldn't expect one unconditional jump in a sequence like this to
make a huge difference, but analyzing the code with performance monitors
shows a very large difference in the number of resource stalls, and the
unconditional jmpq   0x417b2c <luaV_execute+2204> is the primary hotspot
for them.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

Andrew Gierth
>>>>> ">" == 重归混沌  <[hidden email]> writes:

 >> By the way, can you tell me what's the `performance monitors` you
 >> used.

I don't know what tools are available on linux; what I used (on FreeBSD)
was the hwpmc driver and pmcstat, which gets information from the
hardware performance counters in the CPU. The precise set of events that
can be monitored depends a lot on the CPU you're using, but can be quite
extensive.

Another oddity I'm seeing is that the old setobj is giving me timings
that can vary by over 20% based on the length of the command line and
environment - literally adding a few blank spaces can affect it. I'm
guessing this is some caching effect - aliasing between somewhere in the
stack and something else, I suppose. Still looking into that one as time
permits.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

重归混沌
I modify the lua test code and do nothing with lua5.3.4 source code, then setobj become faster.
`
collectgarbage("stop")
local function foo()
     local a = 3
     local b = 4
     for i = 1, 64 * 1024 * 1024 do
         a = b
     end
     print(a)
end
foo()
`
Most likely, the key point is cache.  But no answer in <<64-ia-32-architectures-optimization-manual.>>。

Andrew Gierth <[hidden email]> 于2019年6月15日周六 上午8:01写道:
>>>>> ">" == 重归混沌  <[hidden email]> writes:

 >> By the way, can you tell me what's the `performance monitors` you
 >> used.

I don't know what tools are available on linux; what I used (on FreeBSD)
was the hwpmc driver and pmcstat, which gets information from the
hardware performance counters in the CPU. The precise set of events that
can be monitored depends a lot on the CPU you're using, but can be quite
extensive.

Another oddity I'm seeing is that the old setobj is giving me timings
that can vary by over 20% based on the length of the command line and
environment - literally adding a few blank spaces can affect it. I'm
guessing this is some caching effect - aliasing between somewhere in the
stack and something else, I suppose. Still looking into that one as time
permits.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

Andrew Gierth
>>>>> ">" == 重归混沌  <[hidden email]> writes:

 >> I modify the lua test code and do nothing with lua5.3.4 source code, then
 >> setobj become faster.

 >> Most likely, the key point is cache.  But no answer in
 >> <<64-ia-32-architectures-optimization-manual.

OK, I found out why this happens.

When the OP_FORLOOP code assigns a value to the visible loop variable,
it does so using two separate assignments (see the setivalue macro): one
to the value, one to the tag. The first is a 64-bit move, the second a
32-bit one. The 32-bit padding at the end of the TValue is not modified.

It turns out that when you write a 32-bit value, and then immediately
read the same location as a 64-bit or larger value, then this causes a
stall in the processor, even (apparently) if everything is hot enough to
already be in L1 cache. Presumably the pending store, which is on some
memory write pipeline, is treated as invalidating any memory fetch which
overlaps it.

If instead you write a 32-bit value and then immediately read it back
_as a 32-bit value_, then there is no stall, presumably because the
processor can fetch the whole value out of the write pipeline.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: 'setobj' in lua-5.4.0-alpha-rc2 become more faster

重归混沌
Yes. I think you are right. So the cost is Penalty.  

The answer in <<64-ia-32-architectures-optimization-manual>> I found it.

In lua5.3.4 source code.

the last code in 'vmcase(OP_FORLOOP)', the last code is 'setivalue(ra+3, idx)', and 'setivalue' will be using separate assignments(one for `value_`, and one for `tt_`, the `value_` is 64 bit, and `tt_` is 32 bit). 

Then, luaVM execute the `vmcase(OP_MOVE): setobjs2s(L, ra, RB(i)`, the setobjs2s also use two assignments. But all the two assignment are for 64bit data.

From the <<Intel® 64 and IA-32 Architectures Optimization Reference Manual>> => Chapter 3.6.5 => Fingure 3-3 => Condition (b).

`Size of Load > Store` will has penalty.

Andrew Gierth <[hidden email]> 于2019年6月16日周日 上午3:03写道:
>>>>> ">" == 重归混沌  <[hidden email]> writes:

 >> I modify the lua test code and do nothing with lua5.3.4 source code, then
 >> setobj become faster.

 >> Most likely, the key point is cache.  But no answer in
 >> <<64-ia-32-architectures-optimization-manual.

OK, I found out why this happens.

When the OP_FORLOOP code assigns a value to the visible loop variable,
it does so using two separate assignments (see the setivalue macro): one
to the value, one to the tag. The first is a 64-bit move, the second a
32-bit one. The 32-bit padding at the end of the TValue is not modified.

It turns out that when you write a 32-bit value, and then immediately
read the same location as a 64-bit or larger value, then this causes a
stall in the processor, even (apparently) if everything is hot enough to
already be in L1 cache. Presumably the pending store, which is on some
memory write pipeline, is treated as invalidating any memory fetch which
overlaps it.

If instead you write a 32-bit value and then immediately read it back
_as a 32-bit value_, then there is no stall, presumably because the
processor can fetch the whole value out of the write pipeline.

--
Andrew.