Rici Lake, your mail is broken [MAILER-DAEMON@zewt.org: Undelivered Mail Returned to Sender]

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Rici Lake, your mail is broken [MAILER-DAEMON@zewt.org: Undelivered Mail Returned to Sender]

Glenn Maynard
Forwarded here because I can't mail Rici directly and I think the mail I
was replying to was sent privately unintentionally anyway.  Your mail is
broken; it's using a blacklist that denies legitimate mail (merely
because my mail server is not set up to forward pointlessly through an
unreliable third-party server).  Please use SpamAssassin or a similar
heuristic tool.

-- 
Glenn Maynard
--- Begin Message ---

  • Subject: Undelivered Mail Returned to Sender
  • From: MAILER-DAEMON@... (Mail Delivery System)
  • Date: Sun, 24 Sep 2006 23:05:20 -0400 (EDT)
This is the Postfix program at host c-65-96-98-57.hsd1.ma.comcast.net.

I'm sorry to have to inform you that your message could not
be delivered to one or more recipients. It's attached below.

For further assistance, please send mail to <postmaster>

If you do so, please include this problem report. You can
delete your own text from the attached returned message.

			The Postfix program

<[hidden email]>: host mail.ricilake.net[212.241.164.133] said: 554 5.7.1
    Rejected 65.96.98.57 found in dul.dnsbl.sorbs.net (in reply to RCPT TO
    command)
Reporting-MTA: dns; c-65-96-98-57.hsd1.ma.comcast.net
X-Postfix-Queue-ID: 791A41008EDB2
X-Postfix-Sender: rfc822; [hidden email]
Arrival-Date: Sun, 24 Sep 2006 23:05:09 -0400 (EDT)

Final-Recipient: rfc822; [hidden email]
Action: failed
Diagnostic-Code: X-Postfix; host mail.ricilake.net[212.241.164.133] said: 554
	5.7.1 Rejected 65.96.98.57 found in dul.dnsbl.sorbs.net (in
	reply to RCPT TO command)
--- Begin Message ---

  • Subject: Re: Lua 5.0 to 5.1 performance regression?
  • From: Glenn Maynard <glenn@...>
  • Date: Sun, 24 Sep 2006 23:05:09 -0400
(Did you mean to mail me privately?  If not, feel free to reply back to
the list.)

On Sun, Sep 24, 2006 at 09:20:02PM -0500, Rici Lake wrote:
> >>>   iFreeListRef = ref;
> >>>   if(ref == iMaxReference)
> >>>       --iMaxReference;
> >>
> >>I think that should be:
> >>
> >>    if (ref == iMaxReference)
> >>        --iMaxReference;
> >>    else
> >>        iFreeListRef = ref;
> >>
> >>Otherwise, you'll give out the same references twice.
> >
> >The logic is the same as in lauxlib; it's just forming a linked list
> >of indexes, where iFreeListRef is the head.
> 
> Yes, but lauxlib doesn't try to do that optimization (applicable if you
> are mostly using references as a stack.)

I think iMaxReference should actually never be decremented, since freed
reference indexes are never actually forgotton completely; they're always
either used or in the free list.  (I guess you turned it into an
optimization that does do that, which I suppose works too.)

(Better to just keep a native array of available indexes, anyway.)

> True, but it may be able to do better than the code generated by
> a cast. Now that I think about it, it probably only applies to 
> float->int
> coercions (so I should have said to use lua_tointeger).

True; lrintf() would be faster.  I think the difference is too small to
matter here (we're looking at per-loop timings in microseconds, not
nanoseconds).

> The only other thing I can think of is the pentium cache alignment 
> issue;
> I don't think that could be happening here because you're not doing
> any arithmetic, but in case it is, you might want to check by doing
> the test reffing k things before you start the loop, for k ranging
> from 0 to 5, and see if there are particular values of k which
> cause slowdowns. (There was a change to storage format of tables
> between 5.0 and 5.1, which causes the alignment problem to show
> up for different indices, although it always shows up every sixth
> element in a table or stack.)

Ick, that was it:

0.65user 0.00system 0:00.66elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0.22user 0.00system 0:00.23elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

That's pretty serious; it's a heisenbug generator, making code randomly
slow.  I assume there's no known good fix (or it'd be used); are there
any tradeoff fixes that will at least eliminate the unpredictability?  I
can live with a bit of memory waste and reduced cache efficiency to avoid
this (at least on x86, which have the memory and large caches to cope with
it).

-- 
Glenn Maynard

--- End Message ---

--- End Message ---
Reply | Threaded
Open this post in threaded view
|

pentium woes, was: broken mailers

Rici Lake-2

On 24-Sep-06, at 10:19 PM, Glenn Maynard wrote:

Forwarded here because I can't mail Rici directly and I think the mail I was replying to was sent privately unintentionally anyway. Your mail is
broken; it's using a blacklist that denies legitimate mail (merely
because my mail server is not set up to forward pointlessly through an
unreliable third-party server).  Please use SpamAssassin or a similar
heuristic tool.

Sorry. My email provider does use spamassassin, but with quite a
paranoid setting, I think. I wonder how much other mail I'm missing :(
I think iMaxReference should actually never be decremented, since freed
reference indexes are never actually forgotton completely; they're always
either used or in the free list.  (I guess you turned it into an
optimization that does do that, which I suppose works too.)

(Better to just keep a native array of available indexes, anyway.)

I didn't look at the code all that carefully, it just looked wrong to me. :)

The only other thing I can think of is the pentium cache alignment
issue;
I don't think that could be happening here because you're not doing
any arithmetic, but in case it is, you might want to check by doing
the test reffing k things before you start the loop, for k ranging
from 0 to 5, and see if there are particular values of k which
cause slowdowns. (There was a change to storage format of tables
between 5.0 and 5.1, which causes the alignment problem to show
up for different indices, although it always shows up every sixth
element in a table or stack.)

Ick, that was it:

0.65user 0.00system 0:00.66elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0.22user 0.00system 0:00.23elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

Yeah, it's a pain. The best workaround is to use AMD chips instead of Intel.
(Next time you buy a computer :) ) -- I don't work for AMD or anything,
but it was enough to convince me to change my purchasing preference.

If you don't mind using up a bunch of ram, you can force Lua objects to
doubleword align. That will increase memory usage by up to one-third.
There's a configuration option in luaconf.h which uses "double" as an
alignment, but on x86, doubles are only 4-byte aligned so that doesn't
help. You can add a double-word aligned object to the union (depends on
your compiler how you do that.) You could also try making a lua number
an extended double, if your compiler supports that; I haven't tried
that but it should work on x86, where extended doubles are supposed
to occupy 12 bytes, even though they only use 10. Alternatively, you
can use floats, which reduces memory consumption, but it also loses
a lot of precision, particularly for integers.

On freebsd, where the garbage collector provides predictable alignment,
you can simply not use every sixth element of a table (but this doesn't
help with the stack.) However, you have to figure out which element not
to use, and it won't work on mallocs which are less power-of-2 oriented
anyway. Also, that doesn't help you with numbers stored on the stack.

The problem, in summary, is this: there is no penalty on a pentium for
doubles which are not aligned, unless the double spans
two cache-lines. In that case, there is an enormous penalty (which
depends on the chip.) There are some timing tests (quite old) here:
http://www.cygnus-software.com/papers/infinitytestresults.html
but I don't think things have changed that much.


Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Glenn Maynard
On Sun, Sep 24, 2006 at 11:02:47PM -0500, Rici Lake wrote:
> Yeah, it's a pain. The best workaround is to use AMD chips instead of 
> Intel.
> (Next time you buy a computer :) ) -- I don't work for AMD or anything,
> but it was enough to convince me to change my purchasing preference.

That would be fine, if I only wrote programs to use myself.  :)

> If you don't mind using up a bunch of ram, you can force Lua objects to
> doubleword align. That will increase memory usage by up to one-third.
> There's a configuration option in luaconf.h which uses "double" as an
> alignment, but on x86, doubles are only 4-byte aligned so that doesn't
> help. You can add a double-word aligned object to the union (depends on
> your compiler how you do that.) You could also try making a lua number
> an extended double, if your compiler supports that; I haven't tried
> that but it should work on x86, where extended doubles are supposed
> to occupy 12 bytes, even though they only use 10. Alternatively, you
> can use floats, which reduces memory consumption, but it also loses
> a lot of precision, particularly for integers.

I tried __attribute__((aligned(16)) in various places around lua_TValue,
on lua_Number, and adding a dummy field to Value so TValue is 16 bytes,
but it didn't help.  Not sure what's wrong, I'm not familiar enough with
the code.

I'm also not sure if VC has an equivalent to attribute(aligned).

-- 
Glenn Maynard

Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Rici Lake-2

On 24-Sep-06, at 11:29 PM, Glenn Maynard wrote:

On Sun, Sep 24, 2006 at 11:02:47PM -0500, Rici Lake wrote:
Yeah, it's a pain. The best workaround is to use AMD chips instead of
Intel.
(Next time you buy a computer :) ) -- I don't work for AMD or anything,
but it was enough to convince me to change my purchasing preference.

That would be fine, if I only wrote programs to use myself.  :)

Yeah, I know. I was just making a point, I guess.
I'm also not sure if VC has an equivalent to attribute(aligned).


Me neither. I realized I'd forgotten a lot of the details of this problem.

Anyway, here's a test.

This just does the same loop, with different stack offsets. The slowdown mostly occurs in the for loop itself, so adding to the body of the for was probably unnecessary.

testpent.lua:

/usr/bin/time src/lua -e "for i = 1,1e7 do local j = i + 1 end"
/usr/bin/time src/lua -e "local a1; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8,a9; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8,a9,aa; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af; for i = 1,1e7 do local j = i + 1 end" /usr/bin/time src/lua -e "local a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0; for i = 1,1e7 do local j = i + 1 end"


Unmodified lua 5.1.1:

rlake@freeb:~/src/lua-5.1.1$ . testpent.lua
        0.53 real         0.52 user         0.00 sys
        0.26 real         0.25 user         0.00 sys
        0.25 real         0.25 user         0.00 sys
        0.25 real         0.25 user         0.00 sys
        0.25 real         0.25 user         0.00 sys
        0.26 real         0.25 user         0.00 sys
        0.25 real         0.25 user         0.00 sys
        0.25 real         0.24 user         0.01 sys
        0.25 real         0.25 user         0.00 sys
        0.26 real         0.25 user         0.00 sys
        0.25 real         0.25 user         0.00 sys
        0.25 real         0.25 user         0.00 sys
        0.40 real         0.40 user         0.00 sys
        0.53 real         0.53 user         0.00 sys
        0.32 real         0.31 user         0.00 sys
        0.35 real         0.34 user         0.00 sys
        0.52 real         0.51 user         0.00 sys

Edit: add at line 64, in the Value union:
  char dummy[12];

Retest:
rlake@freeb:~/src/lua-5.1.1$ . testpent.lua
        0.21 real         0.21 user         0.00 sys
        0.22 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.22 real         0.21 user         0.00 sys
        0.21 real         0.19 user         0.02 sys
        0.21 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.22 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys
        0.22 real         0.21 user         0.00 sys
        0.21 real         0.21 user         0.00 sys

The alignment configuration is for alignment of strings and (theoretically) userdata, except that on 32-bit architectures, the Udata type ends up with the payload being misaligned always. So if you're putting doubles in a userdata on 32-bit architecture, watch out. Unfortunately, there doesn't seem to be any way of forcing TValue's to be aligned aside from declaring lua_Number to be a doubleword aligned type, and I don't know how to do that on VC.


Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Rici Lake-2
Oops, I forgot to say that edit was in lobject.h

On 25-Sep-06, at 12:10 AM, Rici Lake wrote:

Edit: add at line 64, in the Value union:

By the way, this issue also shows up on Lua 5.0.2. The difference is that it shows up at different stack offsets. So you have to be really careful when you're comparing naive benchmarks between 5.0.2 and 5.1; the for loop itself (if that's what you're using) might be creating an artificial slow-down in one or the other case.


Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Wim Couwenberg-2
In reply to this post by Glenn Maynard

I'm also not sure if VC has an equivalent to attribute(aligned).

You can use __declspec(align(16)).  See:

    http://msdn2.microsoft.com/en-us/library/83ythb65.aspx

On a 32 bit architecture both Borland and Microsoft compilers use a default 8 byte alignment for doubles. (I think this differs from gcc besed compilers that seem to use 4 byte alignment?) You can use the __alignof operator to get the alignment of a type at runtime.


    http://msdn2.microsoft.com/en-us/library/45t0s5f4.aspx

--
Wim

P.S.: Rici, I suspect you missed a mail or two from me. At least I didn't get a response. (I invited you to the workshop a/o.)

Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Rici Lake-2
Sorry. I have been having lots of email problems, on top of everything else.

I'll whine at my email provider again next week.

I'm really sorry I missed the workshop. But I will come and visit sometime :)

R.

By the way, I don't think MS VC aligns doubles to doublewords by default, otherwise Glenn wouldn't have that problem. (Unless the fact that it's in a union somehow defeats the default, which would be really wierd.) That page you pointed to helpfully says that it depends on the /Zp option, but doesn't seem to hint at what the default might be (aside from suggesting that you don't change it.) gcc indeed does use 4-byte alignment (although you can tell it to use 8 byte) because that's what the x86 ABI defines.

On 25-Sep-06, at 12:38 AM, Wim Couwenberg wrote:


I'm also not sure if VC has an equivalent to attribute(aligned).

You can use __declspec(align(16)).  See:

    http://msdn2.microsoft.com/en-us/library/83ythb65.aspx

On a 32 bit architecture both Borland and Microsoft compilers use a default 8 byte alignment for doubles. (I think this differs from gcc besed compilers that seem to use 4 byte alignment?) You can use the __alignof operator to get the alignment of a type at runtime.


    http://msdn2.microsoft.com/en-us/library/45t0s5f4.aspx

--
Wim

P.S.: Rici, I suspect you missed a mail or two from me. At least I didn't get a response. (I invited you to the workshop a/o.)


Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Glenn Maynard
On Mon, Sep 25, 2006 at 12:52:14AM -0500, Rici Lake wrote:
> By the way, I don't think MS VC aligns doubles to doublewords by 
> default, otherwise Glenn wouldn't have that problem. (Unless the fact 
> that it's in a union somehow defeats the default, which would be really 
> wierd.) That page you pointed to helpfully says that it depends on the 

I use both; I'm testing performance with gcc.

(Currently trying to find a way to get -fomit-frame-pointer when
compiling Lua as part of my larger Automake project, without turning
it on for everything, and grumbling about the mess that Lua 5.1 became
by dumping lua.c and other non-library source files right in src ...)

-- 
Glenn Maynard

Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Mike Pall-65
Hi,

Rici Lake wrote:
> By the way, I don't think MS VC aligns doubles to doublewords by 
> default, otherwise Glenn wouldn't have that problem. (Unless the fact 
> that it's in a union somehow defeats the default, which would be really 
> wierd.) That page you pointed to helpfully says that it depends on the 

The SYSV x86 ABI (on which all POSIX based x86 OS depend)
requires 4 byte alignment for doubles. The Windows x86 ABI (aka
WIN32) requires 8 byte alignment.

This is not an issue of the compiler. GCC with a WIN32 target
(MinGW or CygWin) aligns to 8 bytes just like MSVC/WIN32.

Glenn Maynard wrote:
> I tried __attribute__((aligned(16)) in various places around lua_TValue,
> on lua_Number, and adding a dummy field to Value so TValue is 16 bytes,
> but it didn't help.  Not sure what's wrong, I'm not familiar enough with
> the code.

See the patch below. This comes standard with LuaJIT because the
effect is much more pronounced there.

> I'm also not sure if VC has an equivalent to attribute(aligned).

Sort of. But you won't need it.

Bye,
     Mike
--- lua-5.1.1/src/luaconf.h	2006-04-10 20:27:23 +0200
+++ lua-5.1.1-aligned/src/luaconf.h	2006-09-25 12:00:00 +0200
@@ -581,6 +581,23 @@
 
 #endif
 
+
+/*
+@@ LUA_TVALUE_ALIGN specifies extra alignment constraints for the
+@@ tagged value structure to get better lua_Number alignment.
+** CHANGE it to an empty define if you want to save some space
+** at the cost of execution time. Note that this is only needed
+** for the x86 ABI on most POSIX systems, but not on Windows and
+** not for most other CPUs.
+*/
+
+#if defined(LUA_NUMBER_DOUBLE) && defined(__GNUC__) && \
+    (defined(__i386) || defined(__i386__)) && !defined(_WIN32)
+#define LUA_TVALUE_ALIGN	__attribute__ ((aligned(8)))
+#else
+#define LUA_TVALUE_ALIGN
+#endif
+
 /* }================================================================== */
 
 
--- lua-5.1.1/src/lobject.h	2006-01-18 12:37:34 +0100
+++ lua-5.1.1-aligned/src/lobject.h	2006-09-25 12:00:00 +0200
@@ -72,7 +72,7 @@
 
 typedef struct lua_TValue {
   TValuefields;
-} TValue;
+} LUA_TVALUE_ALIGN TValue;
 
 
 /* Macros to test type */
Reply | Threaded
Open this post in threaded view
|

Re: Rici Lake, your mail is broken [MAILER-DAEMON@zewt.org: Undelivered Mail Returned to Sender]

Rob Kendrick
In reply to this post by Glenn Maynard
On Sun, 2006-09-24 at 23:19 -0400, Glenn Maynard wrote:
> Forwarded here because I can't mail Rici directly and I think the mail I
> was replying to was sent privately unintentionally anyway.  Your mail is
> broken; it's using a blacklist that denies legitimate mail (merely
> because my mail server is not set up to forward pointlessly through an
> unreliable third-party server).  Please use SpamAssassin or a similar
> heuristic tool.

It is hardly Rici's fault if your IP address is listed in real-time
black holes.  (See http://www.robtex.com/rbls/65.96.98.57.html )

I suspect our mail system would think twice about accepting mail from
you, too.  I suggest you lobby the RBLs in question to reduce the amount
of time dynamically allocated IPs stay on the list.

(Personally, I love the fact that DSL/Cable dynamic IPs are mostly
listed - it greatly reduces the amount of spam I receive from people who
don't know how to secure their Windows box.)

On another note, heuristic spam filters are becoming less and less
effective - in a way, they'll never win the battle, as the spammers are
always one step ahead.  Statistical systems are proving to be much more
effective now they've had a chance to develop.

(We use SpamAssassin, and apart from it using an enormous amount of CPU
time and bandwidth, it only catches about 90% of spams, which while
sounds good, that's still hundreds getting through. :-/ )

B.


Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Luiz Henrique de Figueiredo
In reply to this post by Glenn Maynard
> grumbling about the mess that Lua 5.1 became by dumping lua.c and
> other non-library source files right in src ...)

We thought it made things simpler... (It's also simpler to keep track of
dependencies.) On the other hand, src/Makefile tries to document all the
bits that make up the build. For instance, there's

CORE_O= lapi.o lcode.o ldebug.o ldo.o ldump.o lfunc.o lgc.o llex.o lmem.o \
        lobject.o lopcodes.o lparser.o lstate.o lstring.o ltable.o ltm.o  \
        lundump.o lvm.o lzio.o
LIB_O=  lauxlib.o lbaselib.o ldblib.o liolib.o lmathlib.o loslib.o ltablib.o \
        lstrlib.o loadlib.o linit.o

So, if you want to build only the core modules, you could do
	make a LIB_O=

(But you'll probably need to set PLAT or MYCFLAGS inside src/Makefile.)

Do you have (or anyone else) any suggestions on how to improve src/Makefile?
(Perhaps we could replace "all" in the platform-specific rules with $(ALL).)
--lhf

Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Glenn Maynard
On Mon, Sep 25, 2006 at 08:05:58AM -0300, Luiz Henrique de Figueiredo wrote:
> > grumbling about the mess that Lua 5.1 became by dumping lua.c and
> > other non-library source files right in src ...)
> 
> We thought it made things simpler... (It's also simpler to keep track of
> dependencies.) On the other hand, src/Makefile tries to document all the
> bits that make up the build. For instance, there's

I added the Lua source to our CVS tree, and I had to tell people which files
to skip when adding them to their project files for each architecture.  This
is much simpler if library files aren't mixed in with standalone source files.

For most libraries this isn't an issue, since they're just built or installed
normally (so I don't care what their layout is), but Lua is unusual in
encouraging direct embedding, and it would be helpful for the directory
layout to facilitate that.

> Do you have (or anyone else) any suggestions on how to improve src/Makefile?
> (Perhaps we could replace "all" in the platform-specific rules with $(ALL).)

I'm not using the Makefile.  Not all platforms use make (Windows), and
for Unix targets, I'd have to jump hoops for it to be run cleanly by our
Makefile.  (For example, making sure the "build outside of source tree"
scheme supported by automake works.)  I can minimize how exceptional Lua
is (which translates to the amount of effort it takes to bring up and
maintain archs) by adding the files directly to my projects and not trying
to treat it as a library.

(It's also nice for make to not recurse down into Lua's directory every
build.)

-- 
Glenn Maynard

Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Rici Lake-2
In reply to this post by Mike Pall-65

On 25-Sep-06, at 5:17 AM, Mike Pall wrote:

Hi,

Rici Lake wrote:
By the way, I don't think MS VC aligns doubles to doublewords by
default, otherwise Glenn wouldn't have that problem. (Unless the fact
that it's in a union somehow defeats the default, which would be really
wierd.) That page you pointed to helpfully says that it depends on the

The SYSV x86 ABI (on which all POSIX based x86 OS depend)
requires 4 byte alignment for doubles. The Windows x86 ABI (aka
WIN32) requires 8 byte alignment.

I stand corrected. Good to know.

See the patch below. This comes standard with LuaJIT because the
effect is much more pronounced there.

It would probably be useful to have that patch in the standard distribution.


Reply | Threaded
Open this post in threaded view
|

Re: Rici Lake, your mail is broken [MAILER-DAEMON@zewt.org: Undelivered Mail Returned to Sender]

David Jones-2
In reply to this post by Glenn Maynard

Le 25 Sep 2006 à 04:19, Glenn Maynard a écrit :


The only other thing I can think of is the pentium cache alignment
issue;
I don't think that could be happening here because you're not doing
any arithmetic, but in case it is, you might want to check by doing
the test reffing k things before you start the loop, for k ranging
from 0 to 5, and see if there are particular values of k which
cause slowdowns. (There was a change to storage format of tables
between 5.0 and 5.1, which causes the alignment problem to show
up for different indices, although it always shows up every sixth
element in a table or stack.)

Ick, that was it:

0.65user 0.00system 0:00.66elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0.22user 0.00system 0:00.23elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k

That's pretty serious; it's a heisenbug generator, making code randomly
slow.  I assume there's no known good fix (or it'd be used); are there
any tradeoff fixes that will at least eliminate the unpredictability? I can live with a bit of memory waste and reduced cache efficiency to avoid this (at least on x86, which have the memory and large caches to cope with
it).

Umm, ouch.

I remember making the reverse of this change, going from 8-byte alignment to 4-byte alignment of doubles on a PowerPC platform. The _architecture_ says 4-byte aligned doubles may not work, but they worked just fine on the silicon we were using with only a 1-cycle penalty when crossing a cache line boundary. We saved space by going to 4-byte alignment, and the (cache) space saved meant we went faster overall.

drj




Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Pavel Kobel
In reply to this post by Glenn Maynard
Glenn Maynard <[hidden email]> writes:
>
> ... I'm not using the Makefile.  Not all platforms use make (Windows),
> and for Unix targets, I'd have to jump hoops for it to be run cleanly
> by our Makefile.  (For example, making sure the "build outside of source
> tree" scheme supported by automake works.)

Good example of building outside tree is in Sqlite3, small & tight
source base.  Builds outside of tree by design when you ask. Sqlite
also is made for embedding as you say deeply.

pk

Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Glenn Maynard
In reply to this post by Mike Pall-65
On Mon, Sep 25, 2006 at 12:17:16PM +0200, Mike Pall wrote:
> See the patch below. This comes standard with LuaJIT because the
> effect is much more pronounced there.

With this patch, I'm still seeing a 4:1 change in the attached code,
depending on the setting of "fast".  Does it repro for you?

-- 
Glenn Maynard
extern "C"
{
#include <lua.h>
#include <lualib.h>
#include <lauxlib.h>
}

main()
{
	lua_State *L = lua_open();

	lua_pushstring( L, "testing" );
	int index = 1;
	lua_rawseti(L, LUA_REGISTRYINDEX, index++);
	bool fast = true;
	if( fast )
	{
		lua_pushstring( L, "testing" );
		lua_rawseti(L, LUA_REGISTRYINDEX, index++);
	}

	for(int i = 0; i < 1000000; ++i)
	{
		lua_pushnumber( L, 1.5 );
		lua_rawseti(L, LUA_REGISTRYINDEX, index);
		lua_pushnil(L);
		lua_rawseti(L, LUA_REGISTRYINDEX, index);
	}
}

Reply | Threaded
Open this post in threaded view
|

Re: pentium woes, was: broken mailers

Rici Lake-2

On 25-Sep-06, at 2:29 PM, Glenn Maynard wrote:

On Mon, Sep 25, 2006 at 12:17:16PM +0200, Mike Pall wrote:
See the patch below. This comes standard with LuaJIT because the
effect is much more pronounced there.

With this patch, I'm still seeing a 4:1 change in the attached code,
depending on the setting of "fast".  Does it repro for you?

That's a bit of a chimera, I think. The registry is initially created with no array part at all and a hash part capable of containing 2 keys. Until the hash part overflows, the array part is not created. So your "fast" setting guarantees that the hash part will overflow on the first trip through the loop, at which point it will get an array part of size 4, which will speed up the remaining accesses in the loop.

Another reason why micro-benchmarks can be misleading, I guess.