Some questions and answers about the 'in_str_encoding_hint' value and it's use in the implementation of an *external* convert_to_utf8() function.
**********
Q1 - How is the value of 'in_str_encoding_hint' set that gets passed in as an argument to the convert_to_utf8() function?
The MME or io-media sets this value based upon what type of metadata field the string was taken from. This value is then passed into the *internal* function convert_to_utf8() which in turn calls the *external* convert_to_utf8 function written by the user of the MME software.
**********
Q2 - What does an 'in_str_encoding_hint' of NULL mean?
When the 'in_str_encoding_hint' values is set to NULL, this means that either the metadata field from which the string was taken does not specify one of the defined types in 'mm/charconvert.h', or that no guess can be made for this string.
**********
Q3 - What are the possible settings of 'in_str_encoding_hint' for text strings extracted from ID3v1 metadata?
ID3v1 tags will have their 'in_str_encoding_hint' set to NULL. The reason this is set to NULL and not "iso8859-1" is that the specific metadata text encoding that uses the MS Windows code page to encode special characters such as the 'euro' is assumed. This encoding type is very similar to, but not exactly, iso8859-1.
In practice with ID3v1 tags, often an arbitrary (based on the country in question) encoding type is used, and this is where correct detection and conversion of these types of encodings into utf8 is needed by the external convert_to_utf8() function.
**********
Q4 - What are the possible settings of 'in_str_encoding_hint' for text strings extracted from ID3v2 metadata?
You will only see values of 'in_str_encoding_hint' of NULL, 'iso8859-1', 'utf16le', 'utf16be', or 'utf8' for ID3v2 metadata strings. This is because with ID3v2 tags there should be an encoding byte at the beginning of every textual frame specified as:
-----
$00 – ISO-8859-1 (ASCII).
$01 – UCS-2 in ID3v2.2 and ID3v2.3, UTF-16 encoded Unicode with BOM.
$02 – UTF-16BE encoded Unicode without BOM in ID3v2.4 only.
$03 – UTF-8 encoded Unicode in ID3v2.4 only.
-----
Some very basic rule checking is done, and if one of these values is not detected, NULL will be set for the value of 'in_str_encoding_hint' and the following warning message will be logged on the system:
"mpega_parser: invalid text encoding (0x%02X)", etype
Where etype is one of the encoding type values noted above.
**********
Q5 - What are the possible settings of 'in_str_encoding_hint' for text strings extracted from WMA (ASF) metadata?
The text format used for ASF metadata within a WMA file is encoded in 'utf16le' format, and that will be the value of 'in_str_encoding_hint'.
**********
Q6 - Can the value of 'in_str_encoding_hint' be useful in determining the correct encoding type to be used so that conversion to utf8 can occur?
The encoding hint is probably not going to help you much for this purpose. The whole reason why this *external* convert_to_utf8() function is provided is because there are some cases where the encoding can not be determined or when the encoding used is contrary to the specification for the metadata format. Specifically from the docs these reasons are listed:
-----
* Not all file format specifications adequately define how to convert the encoded character strings in media files into a human readable format.
* Some media format specifications, such as some versions of ID3, which require that all character strings be encoded as ISO 8859-1 strings, do not adequately define character encodings for non-Western European alphabets and characters (Polish, Russian, Korean, etc.). Media publishers ignore the specification and use another specification to encode their character strings.
* The work-arounds implemented result in media files whose character strings are encoded according to character encoding specifications other than those required by the media format specifications. For example, an MP3 media file, which according to the ID3 media format specification should have its characters encoded as ISO 8859-1 strings, may contain character strings encoded as ISO 8859-2 to correctly encode Polish.
-----
NOTE:
This entry has been validated against the SDP version listed above. Use
caution when considering this advice for any other SDP version. For
supported releases, please reach out toQNX Technical Support if you have any questions/concerns.