System.Text.Json.Utf8JsonWriter - how to prevent breaking Unicode characters into escape sequences?

I have a JsonNode to write to a file. The JSON contains a string with a special character in it: "🐕".

It's written as "\uD83D\uDC15" and it's not exactly what I want. JSON files support UTF-8, and "🐕" is a valid UTF-8 code point consisting of 4 bytes: 0xF0 0x9F 0x90 0xB6.

Instead I get it translated to 12 bytes, just in case I would edit it on old terminal from the 80s. I'm not interested. I actually use Visual Studio Code for editing the file.

How to force writing without such unwanted translations?

BTW, the file is deserialized correctly, the deserialized string contains valid Unicode codepoint. So - basically the application works, however I'm super curious how to change the serialization behavior.

In case someone's too curious about the code, here it is:

public virtual void Save(JsonNode node, Stream stream) {
    if (node is null) return;
    using var writer = new Utf8JsonWriter(stream, WriterOptions);
    node.WriteTo(writer, SerializerOptions);
}

...where WriterOptions:

public JsonWriterOptions WriterOptions { get; } = new() { Indented = true, SkipValidation = true };

...and SerializerOptions:

public JsonSerializerOptions SerializerOptions { get; } = new() { WriteIndented = true };

Here's an example project showing the issue: https://github.com/HTD/JsonNodeUtfIssue/blob/master/JsonNodeUtfIssue/Program.cs

https://dotnetfiddle.net/73RxAd

1 answer

  • answered 2022-01-26 12:16 Harry

    Here's my workaround. A decoding stream.

    When Utf8JsonWriter writes UTF-8 bytes to the stream, my Utf8DecodeStream searches for Unicode escape sequences, decodes them to UTF-8 bytes and writes instead of original escape sequences.

    It's relatively fast, because it doesn't use regular expressions, string search / replacement, string to number conversions, avoidable allocations and so on.

    It operates directly on binary data (original stream's buffer). It may fail to replace a sequence when it's broken between 2 writes. In such case the file will not be damaged, just that one sequence will be left unchanged.

    OK, there is one special case - if the block boundary will end breaking a long Unicode escape sequence into 2 valid 16-bit codes, it would result with invalid decoding, since 2 16-bit chars decoded to UTF-8 bytes and just concatenated won't produce the valid 32-bit UTF-8 code point.

    BTW, I don't know if the Utf8JsonWriter would break writes in the middle of strings, it might write whole lines, or at least JSON tokens, so mentioned problems might never occur.

    It's worth noting the class uses the escape sequence generated by Utf8JsonWriter - so, for speed, it doesn't decode sequences starting with "\U" or containing lower case hexadecimal digits. The support for different formats can easily be added.

    CAUTION: The Utf8JsonWriter escapes Unicode sequences for a reason that is security. Do not decode if it might make the application vulnerable.

    /// <summary>
    /// Decodes the Unicode escape sequences while writing UTF-8 stream.
    /// </summary>
    /// <remarks>
    /// This is a workaround for a <see cref="Utf8JsonWriter"/> not doing it on its own.
    /// </remarks>
    public class Utf8DecodeStream : Stream {
    
        /// <summary>
        /// Creates a Unicode escape sequence decoding stream over a writeable stream.
        /// </summary>
        /// <param name="stream">A writeable stream.</param>
        public Utf8DecodeStream(Stream stream) => InnerStream = stream;
    
    #pragma warning disable CS1591
    
        public override bool CanRead => InnerStream.CanRead;
    
        public override bool CanSeek => InnerStream.CanSeek;
    
        public override bool CanWrite => InnerStream.CanWrite;
    
        public override long Length => InnerStream.Length;
    
        public override long Position { get => InnerStream.Position; set => InnerStream.Position = value; }
    
        public override void Flush() => InnerStream.Flush();
    
        public override int Read(byte[] buffer, int offset, int count) => InnerStream.Read(buffer, offset, count);
    
        public override long Seek(long offset, SeekOrigin origin) => InnerStream.Seek(offset, origin);
    
        public override void SetLength(long value) => InnerStream.SetLength(value);
    
    #pragma warning restore CS1591
    
        /// <summary>
        /// Writes the buffer with the Unicode sequences decoded.
        /// </summary>
        /// <param name="buffer">Buffer to write.</param>
        /// <param name="offset">Position in the buffer to start.</param>
        /// <param name="count">Number of bytes to write.</param>
        public override void Write(byte[] buffer, int offset, int count) {
            bool sequenceFound = false;
            while (count > 0) {
                sequenceFound = false;
                for (int i = offset, n = offset + count; i < n; i++) {
                    if (DecodeUtf8Sequence(buffer, i, out var sequence, out var bytesConsumed)) {
                        InnerStream.Write(buffer, offset, i - offset);
                        count -= i - offset;
                        InnerStream.Write(sequence);
                        offset = i + bytesConsumed;
                        count -= bytesConsumed;
                        sequenceFound = true;
                        break;
                    }
                }
                if (!sequenceFound) {
                    InnerStream.Write(buffer, offset, count);
                    count = 0;
                }
            }
        }
    
        /// <summary>
        /// Tries to decode one or more subsequent Unicode escape sequences into UTF-8 bytes.
        /// </summary>
        /// <param name="buffer">A buffer to decode.</param>
        /// <param name="index">An index to start decoding from.</param>
        /// <param name="result">An array containing UTF-8 representation of the sequence.</param>
        /// <param name="bytesConsumed">The length of the matched escape sequence.</param>
        /// <returns>True if one or more subsequent Unicode escape sequences is found.</returns>
        private static bool DecodeUtf8Sequence(byte[] buffer, int index, out byte[] result, out int bytesConsumed) {
            bytesConsumed = 0;
            result = Array.Empty<byte>();
            List<char> parts = new(2);
            while (DecodeChar(buffer, index, out var part)) {
                parts.Add(part);
                index += 6;
                bytesConsumed += 6;
            }
            if (parts.Count < 1) return false;
            result = Encoding.UTF8.GetBytes(parts.ToArray());
            return true;
        }
    
        /// <summary>
        /// Tries to decode a single Unicode escape sequence.
        /// </summary>
        /// <remarks>
        /// "\uXXXX" format is assumed for <see cref="Utf8JsonWriter"/> output.
        /// </remarks>
        /// <param name="buffer">A buffer to decode.</param>
        /// <param name="index">An index to start decoding from.</param>
        /// <param name="result">Decoded character.</param>
        /// <returns>True if a single Unicode sequnece is found at specified index.</returns>
        private static bool DecodeChar(byte[] buffer, int index, out char result) {
            result = (char)0;
            if (index + 6 >= buffer.Length || buffer[index] != '\\' || buffer[index + 1] != 'u') return false;
            int charCode = 0;
            for (int i = 0; i < 4; i++)
                if (!DecodeDigit(i, buffer, index + 2, ref charCode)) return false;
            result = (char)charCode;
            return true;
        }
    
        /// <summary>
        /// Tries to decode a single hexadecimal digit from a buffer.
        /// </summary>
        /// <remarks>
        /// Upper case is assumed for <see cref="Utf8JsonWriter"/> output.
        /// </remarks>
        /// <param name="n">A zero-based digit index.</param>
        /// <param name="buffer">Buffer to decode.</param>
        /// <param name="index">Sequence index.</param>
        /// <param name="charCode">Character code reference.</param>
        /// <returns>True if the buffer contains a hexadecimal digit at <paramref name="index"/> + <paramref name="n"/>.</returns>
        private static bool DecodeDigit(int n, byte[] buffer, int index, ref int charCode) {
            var value = buffer[index + n];
            var shift = 12 - (n << 2);
            if (value is >= 48 and <= 57) charCode += (value - 48) << shift;
            else if (value is >= 65 and <= 70) charCode += (value - 55) << shift;
            else return false;
            return true;
        }
    
        /// <summary>
        /// Target stream.
        /// </summary>
        private readonly Stream InnerStream;
    
    }
    

    Usage:

    using var writer = new Utf8JsonWriter(new Utf8DecodeStream(stream));
    

    Example xUnit test:

    [Fact]
    public void Utf8DecodeStreamTest() {
        var test = Encoding.UTF8.GetBytes(@"\uD83D\uDC15 \uD83D\uDC15 \uD83D\uDC15");
        using var stream = new MemoryStream();
        var decoding = new Utf8DecodeStream(stream);
        decoding.Write(test);
        decoding.Flush();
        var result = Encoding.UTF8.GetString(stream.ToArray());
        Assert.Equal("🐕 🐕 🐕", result);
    }
    

How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum