Feature request: "u" option to file:read

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Feature request: "u" option to file:read

Dirk Laurie-2
"u": reads one or more bytes forming one UTF-8 character, and returns
that character as a string. Returns nil if the file at the current
position does not start with a valid UTF-8 sequence.

Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Luiz Henrique de Figueiredo
> "u": reads one or more bytes forming one UTF-8 character, and returns
> that character as a string. Returns nil if the file at the current
> position does not start with a valid UTF-8 sequence.

Can this be done without having to unget more than one byte from the stream?

Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Charles Heywood
No, because a valid UTF-8 sequence can be invalidated multiple bytes in.

On Fri, Feb 23, 2018 at 8:46 AM Luiz Henrique de Figueiredo <[hidden email]> wrote:
> "u": reads one or more bytes forming one UTF-8 character, and returns
> that character as a string. Returns nil if the file at the current
> position does not start with a valid UTF-8 sequence.

Can this be done without having to unget more than one byte from the stream?

--
--
Ryan | Charles <[hidden email]>
Software Developer / System Administrator
Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Coda Highland
If we guarantee that "u" always consumes the character, and it just
returns nil if that character happens to be invalid, then it works. If
the current UTF-8 sequence is invalid (and therefore the read returns
nil), then the byte that you thought was going to be a continuation
byte but was in fact not can is the only one that needs pushed back.

If we have to be able to put the read pointer back where it was before
the read in case the character is invalid (e.g. so a different
function could read out the raw bytes) then that couldn't be done with
a single unget.

/s/ Adam

On Fri, Feb 23, 2018 at 8:48 AM, Charles Heywood <[hidden email]> wrote:

> No, because a valid UTF-8 sequence can be invalidated multiple bytes in.
>
> On Fri, Feb 23, 2018 at 8:46 AM Luiz Henrique de Figueiredo
> <[hidden email]> wrote:
>>
>> > "u": reads one or more bytes forming one UTF-8 character, and returns
>> > that character as a string. Returns nil if the file at the current
>> > position does not start with a valid UTF-8 sequence.
>>
>> Can this be done without having to unget more than one byte from the
>> stream?
>>
> --
> --
> Ryan | Charles <[hidden email]>
> Software Developer / System Administrator
> https://hashbang.sh

Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Roberto Ierusalimschy
> If we have to be able to put the read pointer back where it was before
> the read in case the character is invalid (e.g. so a different
> function could read out the raw bytes) then that couldn't be done with
> a single unget.

The reading of numbers already poses a similar problem, and Lua solves
it the simple way. From the manual:

  "n": reads a numeral and returns it as a float or an integer,
  following the lexical conventions of Lua. (The numeral may have leading
  spaces and a sign.)  This format always reads the longest input sequence
  that is a valid prefix for a numeral; if that prefix does not form a
  valid numeral (e.g., an empty string, "0x", or "3.4e-"), it is discarded
  and the function returns nil.

-- Roberto

Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Dirk Laurie-2
In reply to this post by Luiz Henrique de Figueiredo
2018-02-23 16:46 GMT+02:00 Luiz Henrique de Figueiredo <[hidden email]>:
>> "u": reads one or more bytes forming one UTF-8 character, and returns
>> that character as a string. Returns nil if the file at the current
>> position does not start with a valid UTF-8 sequence.
>
> Can this be done without having to unget more than one byte from the stream?

My rough workaround in Lua is this:

function readutf8(f)
  local pos = f:seek()
  local len = math.min(4,f:seek"end"-pos)
  f:seek("set",pos)
  local buf = f:read(len)
  while utf8.len(buf)~=1 and #buf>1 do buf = buf:sub(1,-2) end
  if utf8.len(buf)==1 then
    f:seek("set",pos+#buf)
    return buf
  end
  f:seek("set",pos)
end

This has only one unget, at the expense of always reading four bytes
except when the file is shorter.

Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Dirk Laurie-2
2018-02-23 18:00 GMT+02:00 Dirk Laurie <[hidden email]>:

> 2018-02-23 16:46 GMT+02:00 Luiz Henrique de Figueiredo <[hidden email]>:
>>> "u": reads one or more bytes forming one UTF-8 character, and returns
>>> that character as a string. Returns nil if the file at the current
>>> position does not start with a valid UTF-8 sequence.
>>
>> Can this be done without having to unget more than one byte from the stream?
>
> My rough workaround in Lua is this:
>
> function readutf8(f)
>   local pos = f:seek()
>   local len = math.min(4,f:seek"end"-pos)
>   f:seek("set",pos)
>   local buf = f:read(len)
>   while utf8.len(buf)~=1 and #buf>1 do buf = buf:sub(1,-2) end
>   if utf8.len(buf)==1 then
>     f:seek("set",pos+#buf)
>     return buf
>   end
>   f:seek("set",pos)
> end
>
> This has only one unget, at the expense of always reading four bytes
> except when the file is shorter.

I have just read "man getchar" and saw that one can't unget the result
of "fgets".
You have to do it one byte at time and only the first byte is guaranteed.

Sorry for the noise.

Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Dirk Laurie-2
In reply to this post by Roberto Ierusalimschy
2018-02-23 17:54 GMT+02:00 Roberto Ierusalimschy <[hidden email]>:

>> If we have to be able to put the read pointer back where it was before
>> the read in case the character is invalid (e.g. so a different
>> function could read out the raw bytes) then that couldn't be done with
>> a single unget.
>
> The reading of numbers already poses a similar problem, and Lua solves
> it the simple way. From the manual:
>
>   "n": reads a numeral and returns it as a float or an integer,
>   following the lexical conventions of Lua. (The numeral may have leading
>   spaces and a sign.)  This format always reads the longest input sequence
>   that is a valid prefix for a numeral; if that prefix does not form a
>   valid numeral (e.g., an empty string, "0x", or "3.4e-"), it is discarded
>   and the function returns nil.

I'll be happy with something similar.

Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Sean Conner
In reply to this post by Dirk Laurie-2
It was thus said that the Great Dirk Laurie once stated:

> 2018-02-23 16:46 GMT+02:00 Luiz Henrique de Figueiredo <[hidden email]>:
> >> "u": reads one or more bytes forming one UTF-8 character, and returns
> >> that character as a string. Returns nil if the file at the current
> >> position does not start with a valid UTF-8 sequence.
> >
> > Can this be done without having to unget more than one byte from the stream?
>
> My rough workaround in Lua is this:
>
> function readutf8(f)
>   local pos = f:seek()
>   local len = math.min(4,f:seek"end"-pos)
>   f:seek("set",pos)
>   local buf = f:read(len)
>   while utf8.len(buf)~=1 and #buf>1 do buf = buf:sub(1,-2) end
>   if utf8.len(buf)==1 then
>     f:seek("set",pos+#buf)
>     return buf
>   end
>   f:seek("set",pos)
> end

  Just a note---you cannot seek on non-files like stdin (that has not been
redirected) or a network connection.  This function can therefore loose data
on such files.

  -spc

Reply | Threaded
Open this post in threaded view
|

Re: Feature request: "u" option to file:read

Dirk Laurie-2
2018-02-23 23:33 GMT+02:00 Sean Conner <[hidden email]>:

> It was thus said that the Great Dirk Laurie once stated:
>> 2018-02-23 16:46 GMT+02:00 Luiz Henrique de Figueiredo <[hidden email]>:
>> >> "u": reads one or more bytes forming one UTF-8 character, and returns
>> >> that character as a string. Returns nil if the file at the current
>> >> position does not start with a valid UTF-8 sequence.
>> >
>> > Can this be done without having to unget more than one byte from the stream?
>>
>> My rough workaround in Lua is this:
>>
>> function readutf8(f)
>>   local pos = f:seek()
>>   local len = math.min(4,f:seek"end"-pos)
>>   f:seek("set",pos)
>>   local buf = f:read(len)
>>   while utf8.len(buf)~=1 and #buf>1 do buf = buf:sub(1,-2) end
>>   if utf8.len(buf)==1 then
>>     f:seek("set",pos+#buf)
>>     return buf
>>   end
>>   f:seek("set",pos)
>> end
>
>   Just a note---you cannot seek on non-files like stdin (that has not been
> redirected) or a network connection.  This function can therefore loose data
> on such files.

"rough workaround"