utf8.len and BOM

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

utf8.len and BOM

Aapo Talvensaari
Is it by design that utf.len count the BOM to length?

Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?

Regards
Aapo
Reply | Threaded
Open this post in threaded view
|

Re: utf8.len and BOM

Aapo Talvensaari
Well, forget about it. It is a correct way to do it.

On Fri Jan 16 2015 at 2:11:41 PM Aapo Talvensaari <[hidden email]> wrote:
Is it by design that utf.len count the BOM to length?

Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?

Regards
Aapo
Reply | Threaded
Open this post in threaded view
|

Re: utf8.len and BOM

Rob Kendrick-2
In reply to this post by Aapo Talvensaari
On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
> Is it by design that utf.len count the BOM to length?
>
> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?

Given UTF8 has only one valid "byte order", it makes no sense to ever
include a byte order marker in a UTF8 document.

B.

Reply | Threaded
Open this post in threaded view
|

Re: utf8.len and BOM

Coda Highland
On Fri, Jan 16, 2015 at 4:53 AM, Rob Kendrick <[hidden email]> wrote:
> On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
>> Is it by design that utf.len count the BOM to length?
>>
>> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?
>
> Given UTF8 has only one valid "byte order", it makes no sense to ever
> include a byte order marker in a UTF8 document.
>

Sure it does -- the UTF-8 BOM is used (and aggressively promoted by
Microsoft) as a magic number to identify the contents of the file as
UTF-8 text. The XML spec even explicitly supports this (although many
XML parsers do not).

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: utf8.len and BOM

Rob Kendrick-2
On Fri, Jan 16, 2015 at 09:17:08AM -0800, Coda Highland wrote:

> On Fri, Jan 16, 2015 at 4:53 AM, Rob Kendrick <[hidden email]> wrote:
> > On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
> >> Is it by design that utf.len count the BOM to length?
> >>
> >> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?
> >
> > Given UTF8 has only one valid "byte order", it makes no sense to ever
> > include a byte order marker in a UTF8 document.
> >
>
> Sure it does -- the UTF-8 BOM is used (and aggressively promoted by
> Microsoft) as a magic number to identify the contents of the file as
> UTF-8 text.

Lots of things aggressively promoted by Microsoft are mistakes.

No BOM -> content is UTF8-encoded.


> The XML spec even explicitly supports this (although many
> XML parsers do not).

Probably because they have to deal with people using Microsoft text
editors, not because it's a recommendation.

B.

Reply | Threaded
Open this post in threaded view
|

Re: utf8.len and BOM

Marco Mastropaolo
In reply to this post by Coda Highland
From the Unicode standard:

>> The serialized order of the bytes must not depart from the order defined by the UTF-
>> 8 encoding form. Use of a BOM is neither required nor recommended for UTF-8, but may
>> be encountered in contexts where UTF-8 data is converted from other encoding forms that
>> use a BOM or where the BOM is used as a UTF-8 signature. 

But, in a nutshell: having a BOM breaks unix utilities, not having it might break windows ones.







On Fri, Jan 16, 2015 at 6:17 PM, Coda Highland <[hidden email]> wrote:
On Fri, Jan 16, 2015 at 4:53 AM, Rob Kendrick <[hidden email]> wrote:
> On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
>> Is it by design that utf.len count the BOM to length?
>>
>> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?
>
> Given UTF8 has only one valid "byte order", it makes no sense to ever
> include a byte order marker in a UTF8 document.
>

Sure it does -- the UTF-8 BOM is used (and aggressively promoted by
Microsoft) as a magic number to identify the contents of the file as
UTF-8 text. The XML spec even explicitly supports this (although many
XML parsers do not).

/s/ Adam


Reply | Threaded
Open this post in threaded view
|

Re: utf8.len and BOM

Coda Highland
In reply to this post by Rob Kendrick-2
On Fri, Jan 16, 2015 at 9:21 AM, Rob Kendrick <[hidden email]> wrote:

> On Fri, Jan 16, 2015 at 09:17:08AM -0800, Coda Highland wrote:
>> On Fri, Jan 16, 2015 at 4:53 AM, Rob Kendrick <[hidden email]> wrote:
>> > On Fri, Jan 16, 2015 at 12:11:41PM +0000, Aapo Talvensaari wrote:
>> >> Is it by design that utf.len count the BOM to length?
>> >>
>> >> Say utf8.len("\xEF\xBB\xBFa") will return 2 instead of 1?
>> >
>> > Given UTF8 has only one valid "byte order", it makes no sense to ever
>> > include a byte order marker in a UTF8 document.
>> >
>>
>> Sure it does -- the UTF-8 BOM is used (and aggressively promoted by
>> Microsoft) as a magic number to identify the contents of the file as
>> UTF-8 text.
>
> Lots of things aggressively promoted by Microsoft are mistakes.
>
> No BOM -> content is UTF8-encoded.
>

I didn't say it was a good idea. ;) Only that there were circumstances
under it could, theoretically, make sense.

That said, it still pisses me off how many common XML parsers bomb out
on something supported by the spec.

/s/ Adam

Reply | Threaded
Open this post in threaded view
|

Re: utf8.len and BOM

Luiz Henrique de Figueiredo
In reply to this post by Rob Kendrick-2
> Probably because they have to deal with people using Microsoft text
> editors, not because it's a recommendation.

That's why Lua also skips BOM whne loading text files.
But it is a pratical concession.