Understanding 'perf report' result lua 5.2.3: __memcpy_sse2_unaligned ?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Understanding 'perf report' result lua 5.2.3: __memcpy_sse2_unaligned ?

Valerio
Hi,
i'm profiling an application that consists in a simple web-server
which rely on the splay runtime[1].
Linux's perf tool, version 3.13.9
I'm using a Linux 3.13.0-24-generic 64-bit machine, ubuntu 1.404 LTS,
running inside a KVM/qemu virtual machine, with 2 GB of Ram and 4
cores.

The client is  wget, which sends a big chunk (128 MB) of binary data
to the web-server.
I launch the application with: "perf record -g -v lua webserver_splay.lua".

Inspecting the results shows something that I cannot really explain:
https://gist.github.com/vschiavoni/cadd01d42557ce87d612 (only the most
heavy entries show).

What is it happening that triggers that many '__memcpy_sse2_unaligned' ?
Can anyone shed some light on this ?

Best regards,
Valerio

1 - https://gist.github.com/vschiavoni/315af2d2ea91876506a2
2 - https://github.com/splay-project/splay/tree/lua52

Reply | Threaded
Open this post in threaded view
|

Re: Understanding 'perf report' result lua 5.2.3: __memcpy_sse2_unaligned ?

Roberto Ierusalimschy
> What is it happening that triggers that many '__memcpy_sse2_unaligned' ?
> Can anyone shed some light on this ?

If I understood the report correctly, there is no indication that there
are too many '__memcpy_sse2_unaligned'; it is big only in comparison
with the rest. If all your server does is to move data around (e.g.,
it reads it from somewhere, creates a Lua string with it, and then writes
it somewhere else), it seems reasonable that memcpy will have a big
share of the total time.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Understanding 'perf report' result lua 5.2.3: __memcpy_sse2_unaligned ?

Valerio
Hello Roberto,
thanks for your explanation.

On Tue, Dec 9, 2014 at 3:36 PM, Roberto Ierusalimschy
<[hidden email]> wrote:
>> What is it happening that triggers that many '__memcpy_sse2_unaligned' ?

> If I understood the report correctly, there is no indication that there
> are too many '__memcpy_sse2_unaligned'; it is big only in comparison
> with the rest. If all your server does is to move data around (e.g.,
> it reads it from somewhere, creates a Lua string with it, and then writes
> it somewhere else),

Well, in my test-case, this is all the server does:

local data = clientsocket:receive(payload_size)

https://gist.github.com/vschiavoni/315af2d2ea91876506a2#file-webserver_splay-lua-L18

As you see, the data is read/received from a (non-blocking) LuaSocket
and then simply ignored until the end of the function.

On a 1Gbs-network, this single call to receive takes an average of 5.3
seconds when the payload_size is big (128MB).
Should I think that it takes sometime for  LuaSocket binding to copy
the received data back into the stack (somewhere here
https://github.com/diegonehab/luasocket/blob/master/src/buffer.c#L136
) ?


best,
valerio

Reply | Threaded
Open this post in threaded view
|

Re: Understanding 'perf report' result lua 5.2.3: __memcpy_sse2_unaligned ?

Achromat
SSE optimized memcpy not faster

http://software.intel.com/en-us/forums/topic/475426


Greetings
Karsten

http://flexxvision.de/luascript.html



-----Urspr√ľngliche Nachricht-----
From: Valerio Schiavoni
Sent: Tuesday, December 09, 2014 4:36 PM
To: Lua mailing list
Subject: Re: Understanding 'perf report' result lua 5.2.3:
__memcpy_sse2_unaligned ?

Hello Roberto,
thanks for your explanation.

On Tue, Dec 9, 2014 at 3:36 PM, Roberto Ierusalimschy
<[hidden email]> wrote:
>> What is it happening that triggers that many '__memcpy_sse2_unaligned' ?

> If I understood the report correctly, there is no indication that there
> are too many '__memcpy_sse2_unaligned'; it is big only in comparison
> with the rest. If all your server does is to move data around (e.g.,
> it reads it from somewhere, creates a Lua string with it, and then writes
> it somewhere else),

Well, in my test-case, this is all the server does:

local data = clientsocket:receive(payload_size)

https://gist.github.com/vschiavoni/315af2d2ea91876506a2#file-webserver_splay-lua-L18

As you see, the data is read/received from a (non-blocking) LuaSocket
and then simply ignored until the end of the function.

On a 1Gbs-network, this single call to receive takes an average of 5.3
seconds when the payload_size is big (128MB).
Should I think that it takes sometime for  LuaSocket binding to copy
the received data back into the stack (somewhere here
https://github.com/diegonehab/luasocket/blob/master/src/buffer.c#L136
) ?


best,
valerio


Reply | Threaded
Open this post in threaded view
|

Re: Understanding 'perf report' result lua 5.2.3: __memcpy_sse2_unaligned ?

Roberto Ierusalimschy
In reply to this post by Valerio
> On a 1Gbs-network, this single call to receive takes an average of 5.3
> seconds when the payload_size is big (128MB).
> Should I think that it takes sometime for  LuaSocket binding to copy
> the received data back into the stack (somewhere here
> https://github.com/diegonehab/luasocket/blob/master/src/buffer.c#L136
> ) ?

I guess you are in a better position than me to answer that question.
(Have a look at [1] too.)

[1] https://github.com/diegonehab/luasocket/blob/master/src/buffer.c#L193

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Understanding 'perf report' result lua 5.2.3: __memcpy_sse2_unaligned ?

Hisham
In reply to this post by Valerio
On 9 December 2014 at 13:36, Valerio Schiavoni
<[hidden email]> wrote:

> Hello Roberto,
> thanks for your explanation.
>
> On Tue, Dec 9, 2014 at 3:36 PM, Roberto Ierusalimschy
> <[hidden email]> wrote:
>>> What is it happening that triggers that many '__memcpy_sse2_unaligned' ?
>
>> If I understood the report correctly, there is no indication that there
>> are too many '__memcpy_sse2_unaligned'; it is big only in comparison
>> with the rest. If all your server does is to move data around (e.g.,
>> it reads it from somewhere, creates a Lua string with it, and then writes
>> it somewhere else),
>
> Well, in my test-case, this is all the server does:
>
> local data = clientsocket:receive(payload_size)
>
> https://gist.github.com/vschiavoni/315af2d2ea91876506a2#file-webserver_splay-lua-L18
>
> As you see, the data is read/received from a (non-blocking) LuaSocket
> and then simply ignored until the end of the function.
> On a 1Gbs-network, this single call to receive takes an average of 5.3
> seconds when the payload_size is big (128MB).

My guess is that you should probably be using LuaSocket's
sources/sinks/filters features (aka "ltn12" module) to perform this
read in a chunk-by-chunk basis instead of making one big 128MB
operation. LuaSocket is probably growing and growing its internal
buffer to make you that big string when in reality you don't ever need
one huge 128MB string in your code.

-- Hisham