Combining two tiny scripts

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Combining two tiny scripts

Peter Matulis
Hi, I am very new to Lua but I do have two tiny Pandoc Lua filters that work. They are extremely similar and I would like to combine them. I'm just not sure how to introduce the conditional aspect.

1. This script simply exchanges the '.md' file extension for the '.html' extension:


function Link (link)
  link.target = link.target:gsub('(.+)%.md', '%1.html')
  return link
end



2. This script does the same except it handles the case where the link contains an anchor:


function Link (link)
  link.target = link.target:gsub('(.+)%.md%#(.+)', '%1.html#%2')
  return link
end



/pm
Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Philippe Verdy-2
Note that Lua patterns do not support alternatives with '|', but they support  '?' for 0 or 1 occurrence (not just '*' and '+' for 0+ and 1+ occurences).
Note also that you want to match the dot for the extension literally, not arbitrary characters, so you must use '%.', not '.'. The '#' does not need to be escaped by a '%' in the search pattern (however it still works if you do).
Just create a capture group for the optional anchor (*including* the leading '#').

function Link (link)
  link.target = link.target:gsub('(.+)%.md(#.+)?', '%1.html%2')
  return link
end

Le jeu. 21 janv. 2021 à 17:38, Peter Matulis <[hidden email]> a écrit :
Hi, I am very new to Lua but I do have two tiny Pandoc Lua filters that work. They are extremely similar and I would like to combine them. I'm just not sure how to introduce the conditional aspect.

1. This script simply exchanges the '.md' file extension for the '.html' extension:


function Link (link)
  link.target = link.target:gsub('(.+)%.md', '%1.html')
  return link
end



2. This script does the same except it handles the case where the link contains an anchor:


function Link (link)
  link.target = link.target:gsub('(.+)%.md%#(.+)', '%1.html#%2')
  return link
end



/pm
Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Albert Krewinkel
In reply to this post by Peter Matulis
Hello Peter,

Peter Matulis <[hidden email]> writes:

> Hi, I am very new to Lua

Welcome :)

> but I do have two tiny Pandoc Lua filters that
> work. They are extremely similar and I would like to combine them. I'm just
> not sure how to introduce the conditional aspect.
>
> 1. This script simply exchanges the '.md' file extension for the '.html'
> extension:
>
>
> function Link (link)
>   link.target = link.target:gsub('(.+)%.md', '%1.html')
>   return link
> end
>
>
> 2. This script does the same except it handles the case where the link
> contains an anchor:
>
>
> function Link (link)
>   link.target = link.target:gsub('(.+)%.md%#(.+)', '%1.html#%2')
>   return link
> end

As far as I can tell, the first version should also work for the second
use case. A pattern may match anywhere in the string (unless "start of
string" or "end of string" are explicitly matched in the pattern).

BTW, we are also happy to answer pandoc-related issues over at the
pandoc-discuss mailing list:
https://groups.google.com/forum/#!forum/pandoc-discuss

Cheers,
Albert


--
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124
Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Egor Skriptunoff-2
In reply to this post by Peter Matulis
On Thu, Jan 21, 2021 at 7:38 PM Peter Matulis wrote:
1. This script simply exchanges the '.md' file extension for the '.html' extension:

function Link (link)
  link.target = link.target:gsub('(.+)%.md', '%1.html')
  return link
end

To avoid replacing ".md" in the middle of a link (as in "www.md.com"), the "$" anchor is needed.
link.target = link.target:gsub('^(.+)%.md$', '%1.html')

The combination of the two scripts may look like the following:
link.target = link.target:gsub('^(.+)%.md%f[%z#]', '%1.html')

Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Philippe Verdy-2
In reply to this post by Albert Krewinkel
ideally the pattern should be anchored at start '^' and end '$', to avoid matching in the middle (e.g. "https://example.org/some.md/something.md#section": the first ".md" is not the expected match; so such problem for the link anchor starting always by the first occurrence of '#' (if a parent path starts by '.md' in a folder I don't think such transform is valid, yuou would transform the parent folder but not the extension at end of the path.

function Link (link)
  link.target = link.target:gsub('^(.+)%.md%#(.+)$', '%1.html#%2')
  return link
end 

In such case, the first or second alternatives given are not equivalent (with '^... $' added), and you still need the '?' quantifier after the capturing group for the link anchor (including '#').

In Lua, it is always best to anchor the patterns, especially for technical things (that may potentially have security issues): anchoring a pattern (notably from the start) is also always faster to avoid false positives: it must matches at the 1st character or will not match immediately, scanning the text stops instantly; anchoring the end also avoid premature matching when you want to make sure you'll scan the full text (in this case it will be slower, however Lua patterns are not scanning the text with backtrailing, only in sequential read, so the text is scanned only once in the forward direction, and does not require internal buffering for backtrailing (Lua just uses a stack of candidate states for quantifiers like '?', '+' or '*', the depth of this stack is limited to the total number of quantifiers in the pattern, where they can coexist in a given state of the matching engine)


Le jeu. 21 janv. 2021 à 18:06, Albert Krewinkel <[hidden email]> a écrit :
Hello Peter,

Peter Matulis <[hidden email]> writes:

> Hi, I am very new to Lua

Welcome :)

> but I do have two tiny Pandoc Lua filters that
> work. They are extremely similar and I would like to combine them. I'm just
> not sure how to introduce the conditional aspect.
>
> 1. This script simply exchanges the '.md' file extension for the '.html'
> extension:
>
>
> function Link (link)
>   link.target = link.target:gsub('(.+)%.md', '%1.html')
>   return link
> end
>
>
> 2. This script does the same except it handles the case where the link
> contains an anchor:
>
>
> function Link (link)
>   link.target = link.target:gsub('(.+)%.md%#(.+)', '%1.html#%2')
>   return link
> end

As far as I can tell, the first version should also work for the second
use case. A pattern may match anywhere in the string (unless "start of
string" or "end of string" are explicitly matched in the pattern).

BTW, we are also happy to answer pandoc-related issues over at the
pandoc-discuss mailing list:
https://groups.google.com/forum/#!forum/pandoc-discuss

Cheers,
Albert


--
Albert Krewinkel
GPG: 8eed e3e2 e8c5 6f18 81fe  e836 388d c0b2 1f63 1124
Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Gé Weijers
In reply to this post by Egor Skriptunoff-2
On Thu, Jan 21, 2021 at 9:10 AM Egor Skriptunoff
<[hidden email]> wrote:

> The combination of the two scripts may look like the following:
> link.target = link.target:gsub('^(.+)%.md%f[%z#]', '%1.html')
>

The %z pattern is no longer documented in Lua 5.3 and 5.4 (I did not
check 5.2 or earlier), I assume it's deprecated. I use \0, it's
perfectly legal to put null characters in the middle of Lua strings.

--

Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Philippe Verdy-2
and here using %z (really deprecated?) or \000 would be incorrect: you don't want to match a nul character in the source text (which would be invalid anyway anywhere in a javascript source, or URL, or HTML using a single-byte character encoding). To match the end of string use the '$' anchor only.

Yes Lua can contain nul bytes (it is "agnostic" about the text encoding and character set used, a Lua string could as well contain UTF-16 or UTF-32, with MANY null bytes, but usually it will be used with UTF-8 or a 7-bit or 8-bit encoding like US-ASCII, ISO-8859-9-*, or a Windows codepage like Win1252, which has almost deprecated ISO-8859-1 in HTML5 where US-ASCII and ISO-8859-1 are changed to compatiblity aliases for Win1252, allowing its encoded graphic characters instead of C1 controls that are almost deprecated but were reserved in ISO-8859-* and interpretation of non-7bit characters as G1 characters of ISO8859; nobody seems to use C1 controls today, and as well 7-bit only charsets are deprecated in HTML, with its many national variants in ISO646; however other 8-bit charsets are still widely used, notably for Cyrillic, Arabic, Thai, and multibyte charsets of East Asia). If you need to handle and covert legacy data (including from various public databases or old interfaces, or legacy filesystems like FAT or MacOS without their UTF-16 extensions, or old FTP sites, or some old telnet, or to handle emails and MIME, or some old *.dbf files, you still need some of these charsets, but Lua natively embeds no support for these conversions in its strings support, you have to include a conversion library in your project.

Le jeu. 21 janv. 2021 à 18:43, Gé Weijers <[hidden email]> a écrit :
On Thu, Jan 21, 2021 at 9:10 AM Egor Skriptunoff
<[hidden email]> wrote:

> The combination of the two scripts may look like the following:
> link.target = link.target:gsub('^(.+)%.md%f[%z#]', '%1.html')
>

The %z pattern is no longer documented in Lua 5.3 and 5.4 (I did not
check 5.2 or earlier), I assume it's deprecated. I use \0, it's
perfectly legal to put null characters in the middle of Lua strings.

--

Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Gé Weijers
On Thu, Jan 21, 2021 at 10:04 AM Philippe Verdy <[hidden email]> wrote:
>
> and here using %z (really deprecated?) or \000 would be incorrect: you don't want to match a nul character in the source text (which would be invalid anyway anywhere in a javascript source, or URL, or HTML using a single-byte character encoding). To match the end of string use the '$' anchor only.

The frontier pattern uses the null character to match the begin or end
of a string, this is documented.

--

Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Philippe Verdy-2
Yes but a string may *embed* null bytes that your don't want to match. The fact that strings are also surrounded by nul bytes just makes matching a bit faster (but the presence of a null byte is not necessarily a start or end of string); in fact the start of string does not even need to be scanned, patterns are matched only in forward direction. So those nul bytes are just guards that facilitates the transmission as classic null-terminated C strings to some API, without always requiring to use extra buffers. (those APIs won't work correctly if the strings are embedding null bytes as they could handle only a truncated value, ignoring the rest).
In general you should avoid using code that depends on null bytes: patterns using the standard '$' anchor is recommended.

Also note that there are also other packages extending the string packages to support UTF-8 or other multibyte UTF's (needing embedded null bytes) or encodings (notably JIS- and CJK-based). As well Lua strings are jsut arrays of bytes with no or little interpretation, they are usable for storing binary-encoded objects (which is much more efficient and compact than using arrays of numbers for byte values), as Lua offers no other native support for "arbitrary byte buffers". That's why the lua string package is so reduced in terms of functionalities (and patterns were also voluntarily limited: for PCRE regexps, you need to use an extra package, but it can be safely written using Lua strings as their base internal storage).
Finally "%z" in patterns is just an alias for "\000" (literal notation, that has no meta-meaning by itself in patterns; '$' does have this meaning and matches 0 characters, whereas "%z" or "\000" matches one byte of actual string content, that will be present in the returned match or capture groups).

So I see no reason to use "%z" or '\000" for this case, it would an unstable solution.

Le jeu. 21 janv. 2021 à 20:07, Gé Weijers <[hidden email]> a écrit :
On Thu, Jan 21, 2021 at 10:04 AM Philippe Verdy <[hidden email]> wrote:
>
> and here using %z (really deprecated?) or \000 would be incorrect: you don't want to match a nul character in the source text (which would be invalid anyway anywhere in a javascript source, or URL, or HTML using a single-byte character encoding). To match the end of string use the '$' anchor only.

The frontier pattern uses the null character to match the begin or end
of a string, this is documented.

--

Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Peter Matulis
In reply to this post by Philippe Verdy-2


On Thu, 21 Jan 2021 at 11:53, Philippe Verdy <[hidden email]> wrote:
Note that Lua patterns do not support alternatives with '|', but they support  '?' for 0 or 1 occurrence (not just '*' and '+' for 0+ and 1+ occurences).
Note also that you want to match the dot for the extension literally, not arbitrary characters, so you must use '%.', not '.'. The '#' does not need to be escaped by a '%' in the search pattern (however it still works if you do).
Just create a capture group for the optional anchor (*including* the leading '#').

function Link (link)
  link.target = link.target:gsub('(.+)%.md(#.+)?', '%1.html%2')
  return link
end


This does not work for me. Neither of the alternatives are acted upon. How should I debug this?

/pm
Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Peter Matulis
It looks like the '?' quantifier can only apply to a single character or class of characters, not a matching pattern like '#.+'

On Mon, 25 Jan 2021 at 14:45, Peter Matulis <[hidden email]> wrote:


On Thu, 21 Jan 2021 at 11:53, Philippe Verdy <[hidden email]> wrote:
Note that Lua patterns do not support alternatives with '|', but they support  '?' for 0 or 1 occurrence (not just '*' and '+' for 0+ and 1+ occurences).
Note also that you want to match the dot for the extension literally, not arbitrary characters, so you must use '%.', not '.'. The '#' does not need to be escaped by a '%' in the search pattern (however it still works if you do).
Just create a capture group for the optional anchor (*including* the leading '#').

function Link (link)
  link.target = link.target:gsub('(.+)%.md(#.+)?', '%1.html%2')
  return link
end


This does not work for me. Neither of the alternatives are acted upon. How should I debug this?

/pm
Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Philippe Verdy-2
Ow, my Lua supports '?' quantifiers on capture groups. Looks like core Lua doesn't.

Alternative (note: the patterns are anchored now at start and end, to match whole string before making any substitution):

link.target = link.target
  :gsub('^(.+)%.md#(.*)$', '%1.html#%2')
  :gsub('^(.+)%.md$', '%1.html')
 
 

Le mar. 26 janv. 2021 à 05:15, Peter Matulis <[hidden email]> a écrit :
It looks like the '?' quantifier can only apply to a single character or class of characters, not a matching pattern like '#.+'

On Mon, 25 Jan 2021 at 14:45, Peter Matulis <[hidden email]> wrote:


On Thu, 21 Jan 2021 at 11:53, Philippe Verdy <[hidden email]> wrote:
Note that Lua patterns do not support alternatives with '|', but they support  '?' for 0 or 1 occurrence (not just '*' and '+' for 0+ and 1+ occurences).
Note also that you want to match the dot for the extension literally, not arbitrary characters, so you must use '%.', not '.'. The '#' does not need to be escaped by a '%' in the search pattern (however it still works if you do).
Just create a capture group for the optional anchor (*including* the leading '#').

function Link (link)
  link.target = link.target:gsub('(.+)%.md(#.+)?', '%1.html%2')
  return link
end


This does not work for me. Neither of the alternatives are acted upon. How should I debug this?

/pm
Reply | Threaded
Open this post in threaded view
|

Re: Combining two tiny scripts

Griffin Rock
In reply to this post by Peter Matulis
target:gsub("%.md%f[%z#]", ".html")


From: Peter Matulis <[hidden email]>
Sent: Friday, 22 January 2021 3:37 AM
To: [hidden email] <[hidden email]>
Subject: Combining two tiny scripts
 
Hi, I am very new to Lua but I do have two tiny Pandoc Lua filters that work. They are extremely similar and I would like to combine them. I'm just not sure how to introduce the conditional aspect.

1. This script simply exchanges the '.md' file extension for the '.html' extension:


function Link (link)
  link.target = link.target:gsub('(.+)%.md', '%1.html')
  return link
end



2. This script does the same except it handles the case where the link contains an anchor:


function Link (link)
  link.target = link.target:gsub('(.+)%.md%#(.+)', '%1.html#%2')
  return link
end



/pm