Until version 1.0 the
wxJSON reader and writer had some issues mostly related to speed. The problem was that both the reader and the writer processed every single character written to / read from streams or strings. Worst, for every character read from / written to a stream the wxJSON's implementation performed a character conversion from UTF-8 to
wchar_t and viceversa. Also note that such a conversion is, for most characters, not needed at all because those chars are in the US-ASCII charset (0x00..0x7F). The speed issues only affected input and output from streams: if the JSON text has to be written to / read from
wxString objects, no conversion is needed because the returned char is already in the correct encoding.
In version 3.0 of the GUI framework, developers have introduced a radical change to Unicode support and the wxString class has totally changed in its internal organisation. In particular, the wxString class now stores strings in UTF-16 encoding on Windows and in UTF-8 on unix systems. The drawback is that on *nix systems the usual character access using subscripts such as:
is VERY inefficient because of the UTF-8 encoding. The conseguence is that in wxJSON there is a speed issue also when the JSON text input is from wxString and not only from streams.
In order to find the best organization for the reader and the writer I have to first point out what are the goals of this new release of wxJSON:
- compatibility with wxWidgets 2.8 and 2.9
- full compatibility with wxJSON version 1.0: I do not want to break the compatibility with 1.0 version otherwise I will have to change the major version number
- speed improvements: I will try to speed-up both the reader and the writer. The conversion of each character is very slow; there are better solutions as pointed out by Piotr Likus in his e-mail of november 2008
- simplicity: JSON format is very easy to read and write for humans but it is also easy for machines to parse and generate. The wxJSON library has to be simple in the processing of JSON text.
The
wxJSON library allows you to write / read JSON text to / from two different types of objects:
- a string of type
wxString
- a stream of type
wxInput/OutputStream
These two kinds of I/O classes are very different because of the internal representation of the JSON text: in particular, wxString uses UTF-16 on windows and UTF-32 on *nix systems up to
wxWidgets 2.8. UTF-8 is used on *nix systems in
wxWidgets
2.9
. For streams the encoding is alwasy UTF-8. A further different encoding is used in ANSI mode: locale dependent one-byte characters.
- Encoding formats in the different wxWidget's mode / versions
These encoding differences complicates very much the organization of the writer and the parser because character read from / written to JSON text has to be converted to a unique type for processing. Actually, each char is converted to a wchar_t type and it occurs in ANSI mode, too. This conversion slows down the processing very much. A further complication is that wxWidgets 2.9 does no more return a char or wchar_t type when accessing string objects but a helper class: wxUniChar which has its own encoding format so that it has to be further converted to wchar_t.
The solution is to use only one encoding format for all types of I/O, build mode and wxWidget's versions: UTF-8 is the only one applicable to all these cases. Using UTF-8 as the unique I/O format has several advantages:
- converting a string to UTF-8 (for the reader) or a UTF-8 stream to a string (for the writer) is very easy and it does not take much time: furthermore, in wxWidgets 2.9 the wxString object uses UTF-8 as the internal encoding on some plaforms so the conversion costs nothing at all.
- UTF-8 does not have endianness or byte order issues
- the pocessing of characters is byte-oriented so there is no need to deal with wchar_t or wxUniChar: special JSON characters, literal and numbers lie in the US-ASCII character set (one UTF-8 byte).
- the processing of strings is easy because when reading a string, the parser just store all UTF-8 bytes up to the closing double-quotes in a temporary buffer. When the string has been read, it is assigned to a wxJSONValue of type string which contains a wxString object. The whole temporary buffer will be converted in a string using
wxString::FromUTF8
.
- writing a wxJSONValue to a UTF-8 stream is easy because all special JSON characters, literals and numbers are written as one-byte characters. Strings are written by converting the whole wxString in UTF-8 using
wxString::ToUTF8
.
- from the point of view of the processing, there is no difference between ANSI and Unicode because the processing is byte-oriented.
Note that UTF-8 is already the normal encoding format for I/O on streams but what about
wxString input and output? The answer is very simple: the parser will convert a
wxString JSON text input in a UTF-8 temporary buffer and processed as a
wxMemoryInputStream. On the other hand, the wxJSON writer will only write to a temporary
wxMemoryOutputStream in UTF-8 format: the
wxString JSON text output is retrieved by converting the temporary buffer to a string object.
In version 1.0 the
wxJSON library gives you a limited Unicode support in AMSI mode when reading UTF-8 streams. For example, suppose we have a UTF-8 file that contains the following data:
{
"us-ascii" : "abcABC",
"latin1" : "àèì©®",
"greek" : "aß?d",
"cyrillic" : "????"
}
We read the file in a wxWidgets application built in ANSI mode and localized in West Europa thus using the ISO-8859-1 (Latin1) character set. Because Latin1 charset does not have support for greek and cyrillic characters, the reader cannot store such values in the wxJSONValue
object because it contains a wxString
object which only uses one-byte locale dependent characters to be stored.
In order to keep the original meaning of data, the wxJSON library converted each character that cannot be represented in the current locale into a unicode escaped sequence. Below you find a representation of the content of the wxJSONvalue
when the file is read:
{
"us-ascii" : "abcABC",
"latin1" : "àèì©®",
"greek" : "\u03B1\u03B2\u03B3\u03B4",
"cyrillic" : "\u0424\u0425\u0426\u0427"
}
I thought that this would be an elegant solution for reading UTF-8 streams in ANSI mode and that data could be exchanged safely from ANSI to Unicode and viceversa but... there are some drawbacks in this solution:
- because of the use of only four hexadecimal digits, only Unicode characters in the first plane (the so-called BMP) can be represented
- writing the wxJSONValue back to its JSON text representation does not revert to UTF-8 encoding: characters are written as unicode escaped sequences but I am not sure that this is valid JSON text altough wxJSON handles it correctly.
- in order to get the unicode escaped sequence of unrepresentable chars, the wxJSON reader has to convert the string char-by-char which is what I want to avoid in this new version because such conversion slows drastically things down.
Because in the new organization the reader and the parser only process UTF-8 streams, there is a problem when the string contains unrepresentable UTF-8 characters. Note that this only happens in the parser class and when the JSON text input is actually from a stream: it does not happen if the processed stream is a temporary UTF-8 buffer obtained by converting the
wxString input text.
The solution suggested by Piotr Likus in his e-mail was pretty simple and very fast: who cares about internal encoding of wxString? When a double-quote character is encontered, just copy all the stream up to the next unescaped double-quote char; only process escaped sequences. The wxString object will, therefore, contain UTF-8 octets in ANY modes and platforms.
Although this would be a very fast solution, one problem still remains: what if the stored strings have to be used / processed / displayed by the application? They surely need to be converted to the native internal encoding which is platform- and mode-dependant.
So, I decided to do the conversion in the wxJSON reader and writer: string values are always stored in the native format so that they can be immediatly processed by the application: for speed purposes, the conversion is done for the whole string, in one step. But this is not applicable in the parser when ANSI mode is used. In the above example, the conversion of the greek and cyrillic strings simply fails and an empty string will be returned by the wxString::FromUTF8
function.
In order to keep the compatibility with the past, the wxJSON parser will still process UTF-8 streams in ANSI mode one character at a time but only if the conversion of the whole string fails. In this way, if the UTF-8 stream only contains characters that can be represented in the current locale (for example because the UTF-8 stream itself is only a temporary buffer, created by converting the wxString object used as input) no speed loss will occur.
Furthermore, the user may specify a special symbol in the include/wx/json_defs.h
header file:
which cause the wxJSON parser to store UTF-8 octets in the wxString object in ANSI mode, without trying any conversion. Note that the conversion is not tried at all, even if the UTF-8 stream is convertible in locale dependent characters.