Pseudo-Complete Lua Syntax

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Pseudo-Complete Lua Syntax

Gavin Kistner-4
(tl;dr: I am complaining about the lack of a proper formal spec, and also asking for feedback on spots that I'm attempting to formalize from prose, and help on the last couple open items.)

I'm attempting to write a full parser for Lua for CodeRay. Section 9 of the Reference Manual claims to contain "the complete syntax of Lua in extended BNF", but it has a few holes:

* There is no definition of the Name production. Based on section 3.1 I understand that expressed in regex it would be
/[a-z_]\w*/i

* There is no definition of the formal syntax of a String in the EBNF, though there is in the prose of section 3.1. I suppose that this regex should match, though I've not tested extensively. Any refinements are welcome:
/(?:"(?:[^\\"]|\\[abfnrtvz\\"']|\\\n|\\\d{1,3}|\\x[\da-fA-F]{2})*")|(?:'(?:[^\\']|\\[abfnrtvz\\"']|\\\n|\\\d{1,3}|\\x[\da-fA-F]{2})*')|(?:--\[(=*)\[\.+?\]\1\])/m

* There is no definition of the formal syntax of a Number. Based on experimentation, it looks like this might be a valid regex for matching a Lua number:
/-?\d*\.?\d+(e[-+]?\d+)?/i
Anyone see anything wrong with that?

* Many spots attempt to express literal statements without proper quoting. For example, the "stat" production should probably look like this, with 25 terminal strings denoted:

stat ::=  ‘;’ |
     varlist ‘=’ explist |
     functioncall |
     label |
     ‘break’ |
     ‘goto’ Name |
     ‘do’ block ‘end’ |
     ‘while’ exp ‘do’ block ‘end’ |
     ‘repeat’ block ‘until’ exp |
     ‘if’ exp ‘then’ block {‘elseif’ exp ‘then’ block} [‘else’ block] ‘end’ |
     ‘for’ Name ‘=’ exp ‘,’ exp [‘,’ exp] ‘do’ block ‘end’ |
     ‘for’ namelist ‘in’ explist ‘do’ block ‘end’ |
     ‘function’ funcname funcbody |
     ‘local function’ Name funcbody |
     ‘local’ namelist [‘=’ explist]

* Section 9 does not cover whitespace at all. Section 3.1 simply says,
> "[Lua] ignores spaces (including new lines) and comments between lexical elements (tokens), except as delimiters between names and keywords."

a) I know that "spaces" above is not just ASCII x20, but includes at least \t. Is it all 26 Unicode whitespace characters defined in http://en.wikipedia.org/wiki/Whitespace_character ? If not, what characters are considered whitespace by Lua?

b) It sure would be nice to include in the formal syntax where whitespace is required vs. optional. Yes, it makes it uglier. It also makes it actually useful, as opposed to a rough sketch open to interpretation.

I would appreciate help as to where whitespace is required versus optional.


Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Luiz Henrique de Figueiredo
The syntax in section 9 of the manual is based on the tokens described
in section 3.1. The tokens in section 9 are either reserved words shown
in bold or literal strings shown quoted. The exception as you have noted
are Name, String, and Number. The words not in bold are non-terminals.

There is indeed no formal grammar for the tokens; the textual description
seems much more useful than regexes.

The whitespace characters are the ones in ASCII, \t, \n, \b, \f, \r, ' ',
ie, those with code 9, 10, 11, 12, 13, and 32.

The program below can help.
--lhf

/*
* ctype.c
* dump lctype table
*/

#include <ctype.h>
#include <stdio.h>
#include "lctype.h"

int main(void)
{
 int c;
 for (c=0; c<256; c++)
 {
  printf("%d\t%c",c,isprint(c)?c:' ');
  printf("\t"); if (lislalpha(c)) printf("lalpha");
  printf("\t"); if (lislalnum(c)) printf("lalnum");
  printf("\t"); if (lisdigit(c)) printf("digit");
  printf("\t"); if (lisxdigit(c)) printf("xdigit");
  printf("\t"); if (lisprint(c)) printf("print");
  printf("\t"); if (lisspace(c)) printf("space");
  printf("\n");
 }
 return 0;
}

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Gavin Kistner-4
On May 28, 2013, at 6:12 PM, Luiz Henrique de Figueiredo <[hidden email]> wrote:
> The syntax in section 9 of the manual is based on the tokens described
> in section 3.1. The tokens in section 9 are either reserved words shown
> in bold or literal strings shown quoted. The exception as you have noted
> are Name, String, and Number. The words not in bold are non-terminals.

Thanks for the response (and the language :).

I never realized that the formatting of bold was intended to convey literal strings, so that's helpful. (But not EBNF, nor described in the manual. Why rely on formatting to convey something already covered by EBNF, and that you use for some of them?)


> There is indeed no formal grammar for the tokens; the textual description
> seems much more useful than regexes.

I agree that a textual description is more useful than regexes. However, an EBNF that isn't rigorous isn't much use (to me, anyhow). Wouldn't you agree that having the (excellent, informative) prose *and* a real, robust, correct grammar in the spec would be better than the current situation?


> The whitespace characters are the ones in ASCII, \t, \n, \b, \f, \r, ' ',
> ie, those with code 9, 10, 11, 12, 13, and 32.

Perfect information about what is considered whitespace, thank you.


However, regarding when and where whitespace is required vs optional: I humbly (and more clearly) reiterate that I think this should be included in a grammar somewhere. I truly do not understand how to rigorously apply the prose:
"[Lua] ignores spaces (including new lines) and comments between lexical elements (tokens), except as delimiters between names and keywords."

I am left with having to carefully consider and test the presence of whitespace before and after each and every surrounding terminal in the grammar to make my own determination based on my understanding of the language and behavior of the Lua interpreter.


Perhaps I'm being whiny because what I'm looking for hasn't been handed to me on a silver platter. However, as currently written, the current section 9 seems to me to be of very little benefit. About the only thing it seems good for, IMHO, is to lie to new users and grammar generators and make them think that the syntax of Lua is far simpler than it is. As it stands, the statement "Here is the complete syntax of Lua in extended BNF" is a lie. It is not complete, and it is not EBNF.

(Yes, that final language is intentionally inflammatory. I'm hoping to prod you right in the honor, with the goal of having more-precise, more-correct specifications. ;)
Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Andres Perera
On Tue, May 28, 2013 at 9:38 PM, Gavin Kistner <[hidden email]> wrote:

> On May 28, 2013, at 6:12 PM, Luiz Henrique de Figueiredo <[hidden email]> wrote:
>> The syntax in section 9 of the manual is based on the tokens described
>> in section 3.1. The tokens in section 9 are either reserved words shown
>> in bold or literal strings shown quoted. The exception as you have noted
>> are Name, String, and Number. The words not in bold are non-terminals.
>
> Thanks for the response (and the language :).
>
> I never realized that the formatting of bold was intended to convey literal strings, so that's helpful. (But not EBNF, nor described in the manual. Why rely on formatting to convey something already covered by EBNF, and that you use for some of them?)
>
>
>> There is indeed no formal grammar for the tokens; the textual description
>> seems much more useful than regexes.
>
> I agree that a textual description is more useful than regexes. However, an EBNF that isn't rigorous isn't much use (to me, anyhow). Wouldn't you agree that having the (excellent, informative) prose *and* a real, robust, correct grammar in the spec would be better than the current situation?

I would say that defining something like Lua/C strings with EBNF, and
even more so with regex, is awkward. I'm not sure where this
expectation is coming from, anyway, since string constants usually
aren't described this way.

This is a grammar for short form string literals. Notice how verbose
it is in comparison to the explanation in $ S3.1 "Lexical
Conventions":

s = dq | sq;

dq = '"', dqb, '"';

sq = "'", sqb, "'";

dqb = { dnc | ec };

dnc = ?? regex [^"\] ??;

sqb = { snc | ec };

snc = ?? regex [^'\] ??;

ec = '\',
   ( ?? regex [abfnrtv\"'z] ??
   | Newline
   | 'x', hexdigit [, hexdigit ]
   | decdigit [, decdigit [, decdigit ] ]
   | '0' );

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Dirk Laurie-2
In reply to this post by Luiz Henrique de Figueiredo
2013/5/29 Luiz Henrique de Figueiredo <[hidden email]>:

> There is indeed no formal grammar for the tokens; the textual description
> seems much more useful than regexes.

Does the reference manual actually say that a Name does not depend on
locale, it depends on lctype.h?

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Pierre-Yves Gérardy
In reply to this post by Gavin Kistner-4
Two more things that are not specified in the grammar, but detailed in the text:

* the break statement is only legal in scopes that descend from a loop.
* In Lua 5.1, you can't put the first parenthesis of a function call
on a new line. The parser complains that the syntax is ambiguous,
where it really isn't). The restriction has been lifted in Lua 5.2.


-- Pierre-Yves

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Luiz Henrique de Figueiredo
In reply to this post by Dirk Laurie-2
> Does the reference manual actually say that a Name does not depend on
> locale, it depends on lctype.h?

The manual says that Names can be any string of letters, digits, and
underscores, not beginning with a digit. It is assumed that everyone
knows what a letter is :-)

lctype.c contains a no-surprise table. It's there for two purposes:
to avoid the locale-dependent ctype table from the C library and
(secondarily) to allow it to be changed in cases one really needs to.

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Roberto Ierusalimschy
In reply to this post by Gavin Kistner-4
> * There is no definition of the formal syntax of a String in the EBNF,
> though there is in the prose of section 3.1. I suppose that this regex
> should match, though I've not tested extensively. Any refinements are
> welcome:
> /(?:"(?:[^\\"]|\\[abfnrtvz\\"']|\\\n|\\\d{1,3}|\\x[\da-fA-F]{2})*")|(?:'(?:[^\\']|\\[abfnrtvz\\"']|\\\n|\\\d{1,3}|\\x[\da-fA-F]{2})*')|(?:--\[(=*)\[\.+?\]\1\])/m

I am afraid I cannot read this very well. But I think the '[^\\"]'
in the beginning should be at least '[^\\"\n\r]' (or classes, by default,
excludes these characters?).

Also, the last part should not include '--' in the beginning.


> * There is no definition of the formal syntax of a Number. Based on experimentation, it looks like this might be a valid regex for matching a Lua number:
> /-?\d*\.?\d+(e[-+]?\d+)?/i
> Anyone see anything wrong with that?

- A '-' is not part of a number (for the lexer). Otherwise, x-3 would
be read as 'x' followed by '-3'.

- The definition of a numeral in Lua follows C (which, in retrospect,
may not have been a very good idea). So, things like "3." are correct,
too.

- Lua accepts hexadecimal numerals. (In 5.2, that includes floating-point
hexas too.)

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Miles Bader-2
Roberto Ierusalimschy <[hidden email]> writes:
> - The definition of a numeral in Lua follows C (which, in retrospect,
> may not have been a very good idea). So, things like "3." are correct,
> too.

In which cases is it not a good idea...?

-miles

--
Kilt, n. A costume sometimes worn by Scotchmen [sic] in America and Americans
in Scotland.

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Roberto Ierusalimschy
> Roberto Ierusalimschy <[hidden email]> writes:
> > - The definition of a numeral in Lua follows C (which, in retrospect,
> > may not have been a very good idea). So, things like "3." are correct,
> > too.
>
> In which cases is it not a good idea...?

- It creates silly conflicts, such as x = 3..7

- It complicates the lexer

- It complicates any informal or formal description of numerals

- It complicates patterns that have to match numerals

Also, one can argue whether writing 3. is any better than 3.0. (It saves
typing one character, but it is an easy one to type.)

Maybe a simpler definition, demanding at least one digit before and after
the dot, would be more "Lua-like".

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Dirk Laurie-2
2013/5/30 Roberto Ierusalimschy <[hidden email]>:

>
> - It creates silly conflicts, such as x = 3..7
>

The conflict confuses a lexer-followed-by-parser, but a simultaneous
lexer-parser like LPeg has no difficulty: 3..7 can match only one way.

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Miles Bader-2
In reply to this post by Roberto Ierusalimschy
Roberto Ierusalimschy <[hidden email]> writes:
> Also, one can argue whether writing 3. is any better than 3.0. (It saves
> typing one character, but it is an easy one to type.)

Hmm, I use the "3." / ".1" style a lot, just because I think it's
prettier than "3.0" / "0.1"...

Not the best argument, but ... :]

-miles

--
`Suppose Korea goes to the World Cup final against Japan and wins,' Moon said.
`All the past could be forgiven.'   [NYT]

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Philippe Lhoste
On 30/05/2013 11:29, Miles Bader wrote:
> Roberto Ierusalimschy <[hidden email]> writes:
>> Also, one can argue whether writing 3. is any better than 3.0. (It saves
>> typing one character, but it is an easy one to type.)
>
> Hmm, I use the "3." / ".1" style a lot, just because I think it's
> prettier than "3.0" / "0.1"...

I never use the 3. / .1 style, because I think it is much uglier than using an explicit
zero, and it is easier to miss the dot with small fonts and bad eyes...
That, and the fact I am French, so we use a comma as decimal separator, and a 3, / ,1
would be even uglier! Perhaps that's why that style doesn't look natural for me... :-)

Anyway, it is the same kind of debate that tabs vs. spaces, placement of opening brace,
braces vs. do / end, and so on... Only personal taste, globally.

--
Philippe Lhoste
--  (near) Paris -- France
--  http://Phi.Lho.free.fr
--  --  --  --  --  --  --  --  --  --  --  --  --  --


Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Gavin Kistner-4
In reply to this post by Roberto Ierusalimschy
On May 29, 2013, at 9:32 AM, Roberto Ierusalimschy <[hidden email]> wrote:
>> * There is no definition of the formal syntax of a String in the EBNF,
>> though there is in the prose of section 3.1. I suppose that this regex
>> should match, though I've not tested extensively. Any refinements are
>> welcome:
>> /(?:"(?:[^\\"]|\\[abfnrtvz\\"']|\\\n|\\\d{1,3}|\\x[\da-fA-F]{2})*")|(?:'(?:[^\\']|\\[abfnrtvz\\"']|\\\n|\\\d{1,3}|\\x[\da-fA-F]{2})*')|(?:--\[(=*)\[\.+?\]\1\])/m
>
> I am afraid I cannot read this very well. But I think the '[^\\"]'
> in the beginning should be at least '[^\\"\n\r]' (or classes, by default,
> excludes these characters?).

Ah, right you are, thank you.

> Also, the last part should not include '--' in the beginning.

Oops! Copy/paste from the long comment. Thanks.

>> * There is no definition of the formal syntax of a Number. Based on experimentation, it looks like this might be a valid regex for matching a Lua number:
>> /-?\d*\.?\d+(e[-+]?\d+)?/i
>> Anyone see anything wrong with that?
>
> - A '-' is not part of a number (for the lexer). Otherwise, x-3 would
> be read as 'x' followed by '-3'.
>
> - The definition of a numeral in Lua follows C (which, in retrospect,
> may not have been a very good idea). So, things like "3." are correct,
> too.

Interesting and helpful.

> - Lua accepts hexadecimal numerals. (In 5.2, that includes floating-point
> hexas too.)

Ah, yes, I had those covered in a separate section.


FWIW, in the end I abandoned my recursive descent parser due to the presence of both direct and indirect left recursion in the grammar, and the amount of rewriting of the grammar I would have needed to do. (The more I rewrite the grammar, the less chance there is of it being correct.) In the end I ended up with one hella big regex for simple-but-effective syntax highlighting :)

https://github.com/Phrogz/coderay/blob/master/lib/coderay/scanners/lua.rb#L31
Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Tim Hill
Interesting discussion. To my mind this shows why it's NOT always a good idea to make everything pure BNF. Sometimes plain old English does it better. I think the compromise in the Lua ref guide is pretty sensible.

--TIm

On Jun 1, 2013, at 9:19 AM, Gavin Kistner <[hidden email]> wrote:

> On May 29, 2013, at 9:32 AM, Roberto Ierusalimschy <[hidden email]> wrote:
>>> * There is no definition of the formal syntax of a String in the EBNF,
>>> though there is in the prose of section 3.1. I suppose that this regex
>>> should match, though I've not tested extensively. Any refinements are
>>> welcome:
>>> /(?:"(?:[^\\"]|\\[abfnrtvz\\"']|\\\n|\\\d{1,3}|\\x[\da-fA-F]{2})*")|(?:'(?:[^\\']|\\[abfnrtvz\\"']|\\\n|\\\d{1,3}|\\x[\da-fA-F]{2})*')|(?:--\[(=*)\[\.+?\]\1\])/m
>>
>> I am afraid I cannot read this very well. But I think the '[^\\"]'
>> in the beginning should be at least '[^\\"\n\r]' (or classes, by default,
>> excludes these characters?).
>
> Ah, right you are, thank you.
>
>> Also, the last part should not include '--' in the beginning.
>
> Oops! Copy/paste from the long comment. Thanks.
>
>>> * There is no definition of the formal syntax of a Number. Based on experimentation, it looks like this might be a valid regex for matching a Lua number:
>>> /-?\d*\.?\d+(e[-+]?\d+)?/i
>>> Anyone see anything wrong with that?
>>
>> - A '-' is not part of a number (for the lexer). Otherwise, x-3 would
>> be read as 'x' followed by '-3'.
>>
>> - The definition of a numeral in Lua follows C (which, in retrospect,
>> may not have been a very good idea). So, things like "3." are correct,
>> too.
>
> Interesting and helpful.
>
>> - Lua accepts hexadecimal numerals. (In 5.2, that includes floating-point
>> hexas too.)
>
> Ah, yes, I had those covered in a separate section.
>
>
> FWIW, in the end I abandoned my recursive descent parser due to the presence of both direct and indirect left recursion in the grammar, and the amount of rewriting of the grammar I would have needed to do. (The more I rewrite the grammar, the less chance there is of it being correct.) In the end I ended up with one hella big regex for simple-but-effective syntax highlighting :)
>
> https://github.com/Phrogz/coderay/blob/master/lib/coderay/scanners/lua.rb#L31


Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Dirk Laurie-2
2013/6/1 Tim Hill <[hidden email]>:
> To my mind this shows why it's NOT always a good idea to make
> everything pure BNF.

An interesting alternative is to use (preferably tested) LPeg.
It's at least as readable as BNF.

Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Gavin Kistner-4
In reply to this post by Tim Hill
On Jun 1, 2013, at 3:17 PM, Tim Hill <[hidden email]> wrote:
> Interesting discussion. To my mind this shows why it's NOT always a good idea to make everything pure BNF. Sometimes plain old English does it better. I think the compromise in the Lua ref guide is pretty sensible.

I'm surprised that you came away with that conclusion from my experiences. The summary in my mind of what happened is:
* I tried to use the EBNF
* I found that I could not because it was not at all complete
* LHF and Roberto very graciously clarified some rules, some of which I overlooked in the prose, some of which are not in the Reference.
* I gave up in part because the EBNF was useless as-is for computer-based parsing, and had to settle for something far less rigorous, and far more error prone.

As before, I wholly agree that the prose is preferable for humans.
But I still maintain that a "complete" grammar that is not complete is not helpful for computers, and is not helpful for humans.

A proper, rigorous, exact (E)BNF or LPeg grammar is still desirable, IMHO.
Reply | Threaded
Open this post in threaded view
|

Re: Pseudo-Complete Lua Syntax

Tim Hill

On Jun 1, 2013, at 1:36 PM, Gavin Kistner <[hidden email]> wrote:

As before, I wholly agree that the prose is preferable for humans.
But I still maintain that a "complete" grammar that is not complete is not helpful for computers, and is not helpful for humans.

Actually I feel the the existing BNF *is* helpful for humans (it is for me, anyway, and I've always assumed I'm human), and my guess is that was the intent in the Lua BNF. It's been a few years, but the last time I read the C standard the grammar as "formally" presented there was also essentially there to clarify the text *for humans*.

One problem with BNF (and other meta-grammers) imho is they stumble when trying to formalize some things that are relatively easy to express in prose. Things like whitespace rules that, for Lua, are expressed in prose and not the BNF, presumably for that reason. (I'm not saying such things cannot be represented in BNF, just that they tend to swell the BNF significantly. While this is OK for a computer it reduces the utility of the BNF for humans.)

--Tim