On Tue, 2013-03-19 at 12:24 +0100, Nicolas Mailhot wrote:
Le Mar 19 mars 2013 11:38, Ian Malone a écrit :
> and holding up the release for what is basically a triviality seems a
> bit silly.
The perception correct UTF-8 handling is a triviality that should be
worked on at some later date is the reason we have this breakage now.
No. As I understand it, this bug would have happened if we were still in
the 20th century and using the legacy 8-bit encodings too.
We have an 'is it text?' function which arbitrarily allows 2% of bytes
to be >= 0x80. Which means that even in ISO8859-1, a file containing
just the words "Schrödinger's Cat" wouldn't be considered to be text.
It's just broken; it's not even UTF-8 specific. In fact, UTF-8 makes
things *easier* because you can check for valid UTF-8 byte sequences
instead of just bytes >= 0x80.
--
dwmw2