mbtowc(3) errno for incomplete character

Discussion:

Ingo Schwarze

2016-02-05 14:53:44 UTC

Hi,

here is a very wide-spread bug, affecting all operating systems i
was able to check.

What happens if an incomplete character is passed to mbtowc(3)?

POSIX is quite clear:

If s is not a null pointer, mbtowc() shall either return 0 (if s
points to the null byte), or return the number of bytes that
constitute the converted character (if the next n or fewer bytes
form a valid character), or return -1 [CX Option Start] and
shall set errno to indicate the error [Option End] (if they do
not form a valid character).

To rephrase the last parenthetic remark: If the next n or fewer
bytes do not form a valid character, errno is set. That clearly
includes the case of an incomplete character (for example, the byte
sequence 0xc3 0x00) and even the case that the final argument is
too small (for example, the byte sequence 0xc3 0xa9, representing
e acute, but with n=1).

Our manual agrees with that. It says:

RETURN VALUES
[...]
-1 s points to an invalid or an incomplete multibyte
character. errno is set to indicate the error.

ERRORS
mbtowc() will set errno in the following cases:

[EILSEQ] s points to an invalid or incomplete multibyte
character.

CAVEATS
On error, callers of mbtowc() cannot tell whether the multibyte
character was invalid or incomplete. To treat incomplete data
differently from invalid data the mbrtowc(3) function can be
used instead.

The NetBSD manual agrees with the OpenBSD manual and POSIX. The
Solaris 11 manual agrees with POSIX 2008, but not with POSIX 2013.
The difference is that XSH/TC1/D5/0373 changed "may set errno" to
"shall set errno", so until three years ago, setting errno was
optional in all cases, even for invalid byte sequences. The FreeBSD
manual is unclear. The Linux manual is completely unclear and
doesn't mention errno at all.

Now obviously, the standard makes a lot of sense. Having a function
that can fail and that sets errno in some failure modes but does
not set errno in other failure modes would be a terrible idea. Such
a function would be ridiculously complicated to use. To detect the
reason for failure, you would have to:

- save errno
- reset errno to zero
- call the function
- inspect the return value to detect failure
- inspect errno to decide about the reason for failure
- if errno is zero, restore the saved errno

That is completely unreasonable, in particular for a seemingly innocous
function like mbtowc(3). Next to no programmer would get that right in
any real-world program.

Now, the problem is that i could find no implementation that actually
implements the standard correctly. All implementations i read or
tested (OpenBSD, NetBSD, FreeBSD, Solaris 11, Linux) behave as follows:

* They fail and set errno when finding an invalid byte sequence.
* They fail but do NOT set errno when finding an incomplete byte sequence.

The following patch fixes this. I worry that by subtly changing
behaviour, even if for the better, it may break ports at run-time.
There may even be interactions with middle-layer libraries. For
example, i'm currently investigating libedit, and it doesn't seem
to use mbtowc(3) quite correctly in this respect. So maybe i should
commit this patch after unlock such that we have more time to mop
up potential fallout.

Thoughts?

Yours,
Ingo

Index: mbtowc.c
===================================================================
RCS file: /cvs/src/lib/libc/locale/mbtowc.c,v
retrieving revision 1.2
diff -u -p -r1.2 mbtowc.c
--- mbtowc.c 5 Dec 2012 23:20:00 -0000 1.2
+++ mbtowc.c 5 Feb 2016 13:26:17 -0000
@@ -44,7 +44,14 @@ mbtowc(wchar_t * __restrict pwc, const c
return (0);
}
rval = mbrtowc(pwc, s, n, &mbs);
- if (rval == (size_t)-1 || rval == (size_t)-2)
- return (-1);
- return ((int)rval);
+
+ switch (rval) {
+ case (size_t)-2:
+ errno = EILSEQ;
+ /* FALLTHROUGH */
+ case (size_t)-1:
+ return -1;
+ default:
+ return (int)rval;
+ }
}

Stefan Sperling

2016-02-05 15:06:50 UTC

Permalink

Post by Ingo Schwarze
Index: mbtowc.c
===================================================================
RCS file: /cvs/src/lib/libc/locale/mbtowc.c,v
retrieving revision 1.2
diff -u -p -r1.2 mbtowc.c
--- mbtowc.c 5 Dec 2012 23:20:00 -0000 1.2
+++ mbtowc.c 5 Feb 2016 13:26:17 -0000
@@ -44,7 +44,14 @@ mbtowc(wchar_t * __restrict pwc, const c
return (0);
}
rval = mbrtowc(pwc, s, n, &mbs);
- if (rval == (size_t)-1 || rval == (size_t)-2)
- return (-1);
- return ((int)rval);
+
+ switch (rval) {
+ errno = EILSEQ;

Doesn't mbrtowc(3) already set errno to EILSEQ when it returns (size_t)-2?
At least our man page says it would:

(size_t)-2 s points to an incomplete byte sequence of length n which
has been consumed and contains part of a valid multibyte
character. mbrtowc() sets errno to EILSEQ. The character

Though citrus_utf8.c doesn't seem to do this... hmmm.
Perhaps you should fix mbrtowc() instead?

Post by Ingo Schwarze
+ /* FALLTHROUGH */
+ return -1;
+ return (int)rval;
+ }
}

Ingo Schwarze

2016-02-05 16:03:09 UTC

Permalink

Hi Stefan,

Post by Stefan Sperling

Doesn't mbrtowc(3) already set errno to EILSEQ when it returns (size_t)-2?

No, it doesn't.

Post by Stefan Sperling
(size_t)-2 s points to an incomplete byte sequence of length n which
has been consumed and contains part of a valid multibyte
character. mbrtowc() sets errno to EILSEQ. The character

That's clearly a documentation bug. Patch below.

Yes, our multibyte code and documentation is still full of bugs,
to a degree quite unusual for OpenBSD. That's why we are working
on it. :-/

Post by Stefan Sperling
Though citrus_utf8.c doesn't seem to do this... hmmm.
Perhaps you should fix mbrtowc() instead?

No. POSIX is quite clear that mbrtowc(3) must not set errno
when returning -2 and that returning -2 does not indicate an error:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/mbrtowc.html

Also, setting errno for incomplete characters in mbrtowc(3) wouldn't
make sense because the typical idiom using mbrtowc(3) is to read
byte by byte and try mbrtowc(3) after each byte until it succeeds.
So the typical use case would always clobber errno, which would be
bad. Even internally in our libc, mbrtowc(3) is used like that
at a few places, for example in vfscanf(3).

Yours,
Ingo

Index: mbrtowc.3
===================================================================
RCS file: /cvs/src/lib/libc/locale/mbrtowc.3,v
retrieving revision 1.4
diff -u -p -r1.4 mbrtowc.3
--- mbrtowc.3 5 Jun 2013 03:39:22 -0000 1.4
+++ mbrtowc.3 5 Feb 2016 15:44:33 -0000
@@ -210,10 +210,6 @@ truncated input.
points to an incomplete byte sequence of length
.Fa n
which has been consumed and contains part of a valid multibyte character.
-.Fn mbrtowc
-sets
-.Va errno
-to EILSEQ.
The character may be completed by calling
.Fn mbrtowc
again with
@@ -230,7 +226,7 @@ function may cause an error in the follo
.Bl -tag -width Er
.It Bq Er EILSEQ
.Fa s
-points to an invalid or incomplete multibyte character.
+points to an invalid multibyte character.
.It Bq Er EINVAL
.Fa mbs
points to an invalid or uninitialized mbstate_t object.

Todd C. Miller

2016-02-06 22:33:08 UTC

Permalink

This makes sense to me. It is possible that the author mistakenly
believed the mbrtowc(3) manual.

- todd

Ingo Schwarze

2016-02-06 23:08:05 UTC

Permalink

Hi,

Post by Todd C. Miller
This makes sense to me. It is possible that the author mistakenly
believed the mbrtowc(3) manual.

Absolutely possible. That's why documentation is so important.
Bugs in documenation usually breed bugs in code.

So i guess i'll eventually put this in, unless somebody comes up
with good reasons not to.

But given that naddy@ already declared ports slowdown: Do you ports
people agree with my suspicion that this is unlikely to be responsible
for major issues in current ports, but that it carries a certain risk
of regressions in poorly written software? Do you agree that it
should wait until after unlock, or do you want it right now?

Yours,
Ingo

For easier reference:

Patch to make sure that mbtowc(3) always sets EILSEQ on error,
in particular for incomplete characters:

Index: mbtowc.c
===================================================================
RCS file: /cvs/src/lib/libc/locale/mbtowc.c,v
retrieving revision 1.2
diff -u -p -r1.2 mbtowc.c
--- mbtowc.c 5 Dec 2012 23:20:00 -0000 1.2
+++ mbtowc.c 6 Feb 2016 23:03:57 -0000
@@ -44,7 +44,14 @@ mbtowc(wchar_t * __restrict pwc, const c
return (0);
}
rval = mbrtowc(pwc, s, n, &mbs);
- if (rval == (size_t)-1 || rval == (size_t)-2)
- return (-1);
- return ((int)rval);
+
+ switch (rval) {
+ case (size_t)-2:
+ errno = EILSEQ;
+ /* FALLTHROUGH */
+ case (size_t)-1:
+ return -1;
+ default:
+ return (int)rval;
+ }
}

Christian Weisgerber

2016-02-07 20:05:56 UTC

Permalink

Post by Ingo Schwarze
people agree with my suspicion that this is unlikely to be responsible
for major issues in current ports, but that it carries a certain risk
of regressions in poorly written software? Do you agree that it
should wait until after unlock, or do you want it right now?

It should wait.

--
Christian "naddy" Weisgerber ***@mips.inka.de