Maybe bug - string representation of a literal mininteger

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Maybe bug - string representation of a literal mininteger

pocomane
It seems to me that there is an inconsistency in how lua handles the
min and the max integer literals. The following code ends without any
error both in Lua 5.3 and Lua 5.4.

```
-- Max integer, both literal and reference are ok
assert(tostring(math.maxinteger) == '9223372036854775807')
assert(tostring(9223372036854775807) == '9223372036854775807')
assert(math.type(9223372036854775807) == 'integer')

-- Min integer+1, literal is ok
assert(tostring(-9223372036854775807) == '-9223372036854775807')
assert(math.type(-9223372036854775807) == 'integer')

-- Min integer, reference is ok
assert(tostring(math.mininteger) == '-9223372036854775808')
assert(math.type(math.mininteger) == 'integer')

-- Min integer literal, BUG ???
assert(tostring(-9223372036854775808) == '-9.2233720368548e+018')
assert(math.type(-9223372036854775808) == 'float')
```

Is it a bug?

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

Muh Muhten
On 11/27/18, pocomane <[hidden email]> wrote:
> -- Min integer literal, BUG ???
> assert(tostring(-9223372036854775808) == '-9.2233720368548e+018')
> assert(math.type(-9223372036854775808) == 'float')

The issue appears to be that the lexer sees '-' separately from the
numeral itself. As such, when reading the number, it must fit in a
non-negative integer, and is then constant-folded to the actual
negative. Incidentally, it appears that this corner case has already
been noticed and is dealt with by the %q format:

lstrlib.c
961:        const char *format = (n == LUA_MININTEGER)  /* corner case? */
962-                           ? "0x%" LUA_INTEGER_FRMLEN "x"  /* use hexa */
963-                           : LUA_INTEGER_FMT;  /* else use default format */

It seems to me that making this particular case work would be rather
involved, and essentially require negative numbers to be first-class
citizens in the grammar, rather than cobbled together through unary
minus and constant folding. I also don't see a satisfying solution
for, e.g. "- 9223372036854775808", "- --[[]]9223372036854775808",
"-(9223372036854775808)", though arguably those are *explicitly* the
negation of positive 9223372036854775808, which doesn't fit, and
really should be an integer.

In any case, the main uses I can see for having that number as a
constant and an integer are qua -2^63 and qua minimum integer. Perhaps
some variation on "-2^63|0", "-1<<63", "~(~0>>1)", or
"-0x8000000000000000" might be suitable? (Though that last one only
happens to work because 0x8000000000000000 == -0x8000000000000000 (64
bits) in 2's complement, and the unchecked overflow from casting to
signed after reading a hex literal may be undefined behavior.)

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

pocomane
On Tue, Nov 27, 2018 at 12:52 PM Muh Muhten <[hidden email]> wrote:
> It seems to me that making this particular case work would be rather
> involved, and essentially require negative numbers to be first-class
> citizens in the grammar, rather than cobbled together through unary
> minus and constant folding. I also don't see a satisfying solution
> for, e.g. "- 9223372036854775808", "- --[[]]9223372036854775808",
> "-(9223372036854775808)", though arguably those are *explicitly* the
> negation of positive 9223372036854775808, which doesn't fit, and
> really should be an integer.

I am just using - 9223372036854775807-1 in my code (in test suites,  I
prefer explicit number over math.mininteger). If it is complex to fix,
I think it could stay as it is. Maybe just a warning in the manual
could be useful.

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

Coda Highland
On Tue, Nov 27, 2018 at 7:17 AM pocomane <[hidden email]> wrote:

>
> On Tue, Nov 27, 2018 at 12:52 PM Muh Muhten <[hidden email]> wrote:
> > It seems to me that making this particular case work would be rather
> > involved, and essentially require negative numbers to be first-class
> > citizens in the grammar, rather than cobbled together through unary
> > minus and constant folding. I also don't see a satisfying solution
> > for, e.g. "- 9223372036854775808", "- --[[]]9223372036854775808",
> > "-(9223372036854775808)", though arguably those are *explicitly* the
> > negation of positive 9223372036854775808, which doesn't fit, and
> > really should be an integer.
>
> I am just using - 9223372036854775807-1 in my code (in test suites,  I
> prefer explicit number over math.mininteger). If it is complex to fix,
> I think it could stay as it is. Maybe just a warning in the manual
> could be useful.
>

The way Lua does it is consistent with how a number of other
programming languages handle tokenization. I know C++ does it that
way, having implemented a C++ parser once upon a time.

There's also the complication of expressions like "2-1" -- is that a
2, a -, and a 1, or is that a 2 and a -1? Obviously the intent is the
former, but any language that generates a token stream from a lexer
that it uses to subsequently construct the syntax tree is going to
have this problem.

There IS a solution available that isn't confused by your first two
examples, but it's not particularly elegant: when a unary minus
expression is parsed, and the operand is a constant, the parser folds
them together in the syntax tree instead of emitting an actual
negation operation. (Your last example isn't just "arguably" the
explicit negation, the use of parentheses makes it DEFINITELY an
explicit negation because the parentheses mean you're demanding that
the expression inside is computed BEFORE you apply the unary minus.)

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

Roberto Ierusalimschy
> There IS a solution available that isn't confused by your first two
> examples, but it's not particularly elegant: when a unary minus
> expression is parsed, and the operand is a constant, the parser folds
> them together in the syntax tree instead of emitting an actual
> negation operation. (Your last example isn't just "arguably" the
> explicit negation, the use of parentheses makes it DEFINITELY an
> explicit negation because the parentheses mean you're demanding that
> the expression inside is computed BEFORE you apply the unary minus.)

Java singles out this particular case :-)

    It is a compile-time error if a decimal literal of type int is
    larger than 2147483648 (2^31), or if the decimal literal 2147483648
    appears anywhere other than as the operand of the unary minus
    operator (§15.15.4).
    [...]
    It is a compile-time error if a decimal literal of type long is
    larger than 9223372036854775808L (2^63), or if the decimal literal
    9223372036854775808L appears anywhere other than as the operand of
    the unary minus operator (§15.15.4).

    https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html#jls-3.10.1

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

Dirk Laurie-2
In reply to this post by pocomane
Op Di. 27 Nov. 2018 om 12:32 het pocomane <[hidden email]> geskryf:

>
> It seems to me that there is an inconsistency in how lua handles the
> min and the max integer literals. The following code ends without any
> error both in Lua 5.3 and Lua 5.4.
>
> ```
> -- Max integer, both literal and reference are ok
> assert(tostring(math.maxinteger) == '9223372036854775807')
> assert(tostring(9223372036854775807) == '9223372036854775807')
> assert(math.type(9223372036854775807) == 'integer')
>
> -- Min integer+1, literal is ok
> assert(tostring(-9223372036854775807) == '-9223372036854775807')
> assert(math.type(-9223372036854775807) == 'integer')
>
> -- Min integer, reference is ok
> assert(tostring(math.mininteger) == '-9223372036854775808')
> assert(math.type(math.mininteger) == 'integer')
>
> -- Min integer literal, BUG ???
> assert(tostring(-9223372036854775808) == '-9.2233720368548e+018')
> assert(math.type(-9223372036854775808) == 'float')
> ```
> Is it a bug?

On my system (Ubuntu 18.04, amd64 processor), the second-last
assertion fails, but the last one succeeds.

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

Andrew Gierth
>>>>> "Dirk" == Dirk Laurie <[hidden email]> writes:

 >> -- Min integer literal, BUG ???
 >> assert(tostring(-9223372036854775808) == '-9.2233720368548e+018')
 >> assert(math.type(-9223372036854775808) == 'float')

 Dirk> On my system (Ubuntu 18.04, amd64 processor), the second-last
 Dirk> assertion fails, but the last one succeeds.

The assert(tostring(-9223372036854775808) == '-9.2233720368548e+018')
is fragile in that it makes assumptions about the float output format
that may not be true - e.g. whether the exponent is given as e+018
or e+18 (which is what I get on freebsd) or e18.

--
Andrew.

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

pocomane
On Wed, Nov 28, 2018 at 1:00 AM Andrew Gierth
<[hidden email]> wrote:

>
> >>>>> "Dirk" == Dirk Laurie <[hidden email]> writes:
>
>  Dirk> On my system (Ubuntu 18.04, amd64 processor), the second-last
>  Dirk> assertion fails, but the last one succeeds.
>
> The assert(tostring(-9223372036854775808) == '-9.2233720368548e+018')
> is fragile in that it makes assumptions about the float output format
> that may not be true - e.g. whether the exponent is given as e+018
> or e+18 (which is what I get on freebsd) or e18.

Sorry, I think a better way to test it is:

```
assert(tostring(-9223372036854775808) ~= '-9223372036854775808')
```

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

Philippe Verdy
In reply to this post by Muh Muhten
The parser can still initially parse the "-" token separately from the parsed value "9223372036854775808" which cannot be an integer and is then given a double value (hoping that the double will keep its precision, but even if a double has less bits, the truncated bits in this case are all zeroes.

So the double value 9223372036854775808 which is still a constant, can easily be tested when it appears next to the unary minur operator in the syntax tree which now as tow tokens: suring the "reduce" step of the parser: the unary minus and the double value which is still exactly exactly equal to 9223372036854775808.
This just requires an additional reduce rule in the syntaxic parser (not in the lexer) to resolve it as if it was a single precomputed negative integer constant for this specific case.

However this would mean that -922337203685477580.8e1 would turn to be parsed as an integer even if the source specified it explicitly as a double.

There are two ways to handle this:
- either the lexer does not parse numeric strings itself, and leaves the token as a string. The actual conversion to a datatype will be done in the parser.
- or the lexer parse the numeric strings and does not only return the token with attributes set to the numeric value, but also containing an "hint" indicator of which parser (integer or floating point) it used to reduce the numeric string to a resolved floating point constant. This approaches complicates only one parser rule:

unaryexpression ::= '-' INTCONSTANT
unaryexpression ::= '-' DOUBLECONSTANT
  { if (tokens[1].doublevalue == 9223372036854775808.0) {
       tokens[1].type = INTCONSTANT;
       tokens[1].intvalue = -9223372036854775808;
  }
unaryexpression ::= ....
unaryexpression ::= '-' (expression)

(here the "tokens[]" is some (C-like) array that access to properties of tokens returned by the lexer, and being reduced in the parsiing rules (whose number of tokens in the array is determined by the parsing rule, here there are 2 tokens) and stored and modified by the parser in its abstract syntax tree, assuming that tokens[i].type is one of the defined token type constants returned by the lexer which can also set tokens[i].intvalue or token[i].doublevalue, or token[i].stringvalue for NAME token types or for LITTERALSTRING token types).


Le mar. 27 nov. 2018 à 12:51, Muh Muhten <[hidden email]> a écrit :
On 11/27/18, pocomane <[hidden email]> wrote:
> -- Min integer literal, BUG ???
> assert(tostring(-9223372036854775808) == '-9.2233720368548e+018')
> assert(math.type(-9223372036854775808) == 'float')

The issue appears to be that the lexer sees '-' separately from the
numeral itself. As such, when reading the number, it must fit in a
non-negative integer, and is then constant-folded to the actual
negative. Incidentally, it appears that this corner case has already
been noticed and is dealt with by the %q format:

lstrlib.c
961:        const char *format = (n == LUA_MININTEGER)  /* corner case? */
962-                           ? "0x%" LUA_INTEGER_FRMLEN "x"  /* use hexa */
963-                           : LUA_INTEGER_FMT;  /* else use default format */

It seems to me that making this particular case work would be rather
involved, and essentially require negative numbers to be first-class
citizens in the grammar, rather than cobbled together through unary
minus and constant folding. I also don't see a satisfying solution
for, e.g. "- 9223372036854775808", "- --[[]]9223372036854775808",
"-(9223372036854775808)", though arguably those are *explicitly* the
negation of positive 9223372036854775808, which doesn't fit, and
really should be an integer.

In any case, the main uses I can see for having that number as a
constant and an integer are qua -2^63 and qua minimum integer. Perhaps
some variation on "-2^63|0", "-1<<63", "~(~0>>1)", or
"-0x8000000000000000" might be suitable? (Though that last one only
happens to work because 0x8000000000000000 == -0x8000000000000000 (64
bits) in 2's complement, and the unchecked overflow from casting to
signed after reading a hex literal may be undefined behavior.)

Reply | Threaded
Open this post in threaded view
|

Re: Maybe bug - string representation of a literal mininteger

Philippe Verdy
Better solution:

unaryexpression ::= '-' DOUBLECONSTANT
unaryexpression ::= '-' INTCONSTANT
  {
       //discard the AST generated in $0 for the reduction of $1 $2, replace it with a single token:
       $0.type = INTCONSTANT
       if ($1.intvalue < 0)
           $0.intvalue= $2.intvalue;
       else
          $0.intvalue = -$2.intvalue;
  }
unaryexpression ::= ....
unaryexpression ::= '-' (expression)

This way there's no risk of interpreting "-922337203685477580.8e1" as an integer, it will remain a double.

It is assumed that the lexer will return always positive values when parsing numeric strings and returning DOUBLECONSTANT and INTCONSTANT token types, except that when it will use the integer parser and will return an INTCONSTANT token type, it will set the token.intvalue directly, which will then turn to be negative for the specific MININT value. We can detect it easily during the reduction of the rule, to discard the unary minus, because the "intvalue" is already correct and does not need to be negated.

Le jeu. 29 nov. 2018 à 15:57, Philippe Verdy <[hidden email]> a écrit :
The parser can still initially parse the "-" token separately from the parsed value "9223372036854775808" which cannot be an integer and is then given a double value (hoping that the double will keep its precision, but even if a double has less bits, the truncated bits in this case are all zeroes.

So the double value 9223372036854775808 which is still a constant, can easily be tested when it appears next to the unary minur operator in the syntax tree which now as tow tokens: suring the "reduce" step of the parser: the unary minus and the double value which is still exactly exactly equal to 9223372036854775808.
This just requires an additional reduce rule in the syntaxic parser (not in the lexer) to resolve it as if it was a single precomputed negative integer constant for this specific case.

However this would mean that -922337203685477580.8e1 would turn to be parsed as an integer even if the source specified it explicitly as a double.

There are two ways to handle this:
- either the lexer does not parse numeric strings itself, and leaves the token as a string. The actual conversion to a datatype will be done in the parser.
- or the lexer parse the numeric strings and does not only return the token with attributes set to the numeric value, but also containing an "hint" indicator of which parser (integer or floating point) it used to reduce the numeric string to a resolved floating point constant. This approaches complicates only one parser rule:

unaryexpression ::= '-' INTCONSTANT
unaryexpression ::= '-' DOUBLECONSTANT
  { if (tokens[1].doublevalue == 9223372036854775808.0) {
       tokens[1].type = INTCONSTANT;
       tokens[1].intvalue = -9223372036854775808;
  }
unaryexpression ::= ....
unaryexpression ::= '-' (expression)

(here the "tokens[]" is some (C-like) array that access to properties of tokens returned by the lexer, and being reduced in the parsiing rules (whose number of tokens in the array is determined by the parsing rule, here there are 2 tokens) and stored and modified by the parser in its abstract syntax tree, assuming that tokens[i].type is one of the defined token type constants returned by the lexer which can also set tokens[i].intvalue or token[i].doublevalue, or token[i].stringvalue for NAME token types or for LITTERALSTRING token types).


Le mar. 27 nov. 2018 à 12:51, Muh Muhten <[hidden email]> a écrit :
On 11/27/18, pocomane <[hidden email]> wrote:
> -- Min integer literal, BUG ???
> assert(tostring(-9223372036854775808) == '-9.2233720368548e+018')
> assert(math.type(-9223372036854775808) == 'float')

The issue appears to be that the lexer sees '-' separately from the
numeral itself. As such, when reading the number, it must fit in a
non-negative integer, and is then constant-folded to the actual
negative. Incidentally, it appears that this corner case has already
been noticed and is dealt with by the %q format:

lstrlib.c
961:        const char *format = (n == LUA_MININTEGER)  /* corner case? */
962-                           ? "0x%" LUA_INTEGER_FRMLEN "x"  /* use hexa */
963-                           : LUA_INTEGER_FMT;  /* else use default format */

It seems to me that making this particular case work would be rather
involved, and essentially require negative numbers to be first-class
citizens in the grammar, rather than cobbled together through unary
minus and constant folding. I also don't see a satisfying solution
for, e.g. "- 9223372036854775808", "- --[[]]9223372036854775808",
"-(9223372036854775808)", though arguably those are *explicitly* the
negation of positive 9223372036854775808, which doesn't fit, and
really should be an integer.

In any case, the main uses I can see for having that number as a
constant and an integer are qua -2^63 and qua minimum integer. Perhaps
some variation on "-2^63|0", "-1<<63", "~(~0>>1)", or
"-0x8000000000000000" might be suitable? (Though that last one only
happens to work because 0x8000000000000000 == -0x8000000000000000 (64
bits) in 2's complement, and the unchecked overflow from casting to
signed after reading a hex literal may be undefined behavior.)