Ingo Schwarze
2016-02-05 14:53:44 UTC
Hi,
here is a very wide-spread bug, affecting all operating systems i
was able to check.
What happens if an incomplete character is passed to mbtowc(3)?
POSIX is quite clear:
If s is not a null pointer, mbtowc() shall either return 0 (if s
points to the null byte), or return the number of bytes that
constitute the converted character (if the next n or fewer bytes
form a valid character), or return -1 [CX Option Start] and
shall set errno to indicate the error [Option End] (if they do
not form a valid character).
To rephrase the last parenthetic remark: If the next n or fewer
bytes do not form a valid character, errno is set. That clearly
includes the case of an incomplete character (for example, the byte
sequence 0xc3 0x00) and even the case that the final argument is
too small (for example, the byte sequence 0xc3 0xa9, representing
e acute, but with n=1).
Our manual agrees with that. It says:
RETURN VALUES
[...]
-1 s points to an invalid or an incomplete multibyte
character. errno is set to indicate the error.
ERRORS
mbtowc() will set errno in the following cases:
[EILSEQ] s points to an invalid or incomplete multibyte
character.
CAVEATS
On error, callers of mbtowc() cannot tell whether the multibyte
character was invalid or incomplete. To treat incomplete data
differently from invalid data the mbrtowc(3) function can be
used instead.
The NetBSD manual agrees with the OpenBSD manual and POSIX. The
Solaris 11 manual agrees with POSIX 2008, but not with POSIX 2013.
The difference is that XSH/TC1/D5/0373 changed "may set errno" to
"shall set errno", so until three years ago, setting errno was
optional in all cases, even for invalid byte sequences. The FreeBSD
manual is unclear. The Linux manual is completely unclear and
doesn't mention errno at all.
Now obviously, the standard makes a lot of sense. Having a function
that can fail and that sets errno in some failure modes but does
not set errno in other failure modes would be a terrible idea. Such
a function would be ridiculously complicated to use. To detect the
reason for failure, you would have to:
- save errno
- reset errno to zero
- call the function
- inspect the return value to detect failure
- inspect errno to decide about the reason for failure
- if errno is zero, restore the saved errno
That is completely unreasonable, in particular for a seemingly innocous
function like mbtowc(3). Next to no programmer would get that right in
any real-world program.
Now, the problem is that i could find no implementation that actually
implements the standard correctly. All implementations i read or
tested (OpenBSD, NetBSD, FreeBSD, Solaris 11, Linux) behave as follows:
* They fail and set errno when finding an invalid byte sequence.
* They fail but do NOT set errno when finding an incomplete byte sequence.
The following patch fixes this. I worry that by subtly changing
behaviour, even if for the better, it may break ports at run-time.
There may even be interactions with middle-layer libraries. For
example, i'm currently investigating libedit, and it doesn't seem
to use mbtowc(3) quite correctly in this respect. So maybe i should
commit this patch after unlock such that we have more time to mop
up potential fallout.
Thoughts?
Yours,
Ingo
Index: mbtowc.c
===================================================================
RCS file: /cvs/src/lib/libc/locale/mbtowc.c,v
retrieving revision 1.2
diff -u -p -r1.2 mbtowc.c
--- mbtowc.c 5 Dec 2012 23:20:00 -0000 1.2
+++ mbtowc.c 5 Feb 2016 13:26:17 -0000
@@ -44,7 +44,14 @@ mbtowc(wchar_t * __restrict pwc, const c
return (0);
}
rval = mbrtowc(pwc, s, n, &mbs);
- if (rval == (size_t)-1 || rval == (size_t)-2)
- return (-1);
- return ((int)rval);
+
+ switch (rval) {
+ case (size_t)-2:
+ errno = EILSEQ;
+ /* FALLTHROUGH */
+ case (size_t)-1:
+ return -1;
+ default:
+ return (int)rval;
+ }
}
here is a very wide-spread bug, affecting all operating systems i
was able to check.
What happens if an incomplete character is passed to mbtowc(3)?
POSIX is quite clear:
If s is not a null pointer, mbtowc() shall either return 0 (if s
points to the null byte), or return the number of bytes that
constitute the converted character (if the next n or fewer bytes
form a valid character), or return -1 [CX Option Start] and
shall set errno to indicate the error [Option End] (if they do
not form a valid character).
To rephrase the last parenthetic remark: If the next n or fewer
bytes do not form a valid character, errno is set. That clearly
includes the case of an incomplete character (for example, the byte
sequence 0xc3 0x00) and even the case that the final argument is
too small (for example, the byte sequence 0xc3 0xa9, representing
e acute, but with n=1).
Our manual agrees with that. It says:
RETURN VALUES
[...]
-1 s points to an invalid or an incomplete multibyte
character. errno is set to indicate the error.
ERRORS
mbtowc() will set errno in the following cases:
[EILSEQ] s points to an invalid or incomplete multibyte
character.
CAVEATS
On error, callers of mbtowc() cannot tell whether the multibyte
character was invalid or incomplete. To treat incomplete data
differently from invalid data the mbrtowc(3) function can be
used instead.
The NetBSD manual agrees with the OpenBSD manual and POSIX. The
Solaris 11 manual agrees with POSIX 2008, but not with POSIX 2013.
The difference is that XSH/TC1/D5/0373 changed "may set errno" to
"shall set errno", so until three years ago, setting errno was
optional in all cases, even for invalid byte sequences. The FreeBSD
manual is unclear. The Linux manual is completely unclear and
doesn't mention errno at all.
Now obviously, the standard makes a lot of sense. Having a function
that can fail and that sets errno in some failure modes but does
not set errno in other failure modes would be a terrible idea. Such
a function would be ridiculously complicated to use. To detect the
reason for failure, you would have to:
- save errno
- reset errno to zero
- call the function
- inspect the return value to detect failure
- inspect errno to decide about the reason for failure
- if errno is zero, restore the saved errno
That is completely unreasonable, in particular for a seemingly innocous
function like mbtowc(3). Next to no programmer would get that right in
any real-world program.
Now, the problem is that i could find no implementation that actually
implements the standard correctly. All implementations i read or
tested (OpenBSD, NetBSD, FreeBSD, Solaris 11, Linux) behave as follows:
* They fail and set errno when finding an invalid byte sequence.
* They fail but do NOT set errno when finding an incomplete byte sequence.
The following patch fixes this. I worry that by subtly changing
behaviour, even if for the better, it may break ports at run-time.
There may even be interactions with middle-layer libraries. For
example, i'm currently investigating libedit, and it doesn't seem
to use mbtowc(3) quite correctly in this respect. So maybe i should
commit this patch after unlock such that we have more time to mop
up potential fallout.
Thoughts?
Yours,
Ingo
Index: mbtowc.c
===================================================================
RCS file: /cvs/src/lib/libc/locale/mbtowc.c,v
retrieving revision 1.2
diff -u -p -r1.2 mbtowc.c
--- mbtowc.c 5 Dec 2012 23:20:00 -0000 1.2
+++ mbtowc.c 5 Feb 2016 13:26:17 -0000
@@ -44,7 +44,14 @@ mbtowc(wchar_t * __restrict pwc, const c
return (0);
}
rval = mbrtowc(pwc, s, n, &mbs);
- if (rval == (size_t)-1 || rval == (size_t)-2)
- return (-1);
- return ((int)rval);
+
+ switch (rval) {
+ case (size_t)-2:
+ errno = EILSEQ;
+ /* FALLTHROUGH */
+ case (size_t)-1:
+ return -1;
+ default:
+ return (int)rval;
+ }
}