NIPO ODIN Version 5.17

Previous Topic

Next Topic

Multi-byte Character Fields (MBCSFields Setting)

Note:
This section is only relevant for NIPO Software users who require multi-byte characters in non-Unicode data files

With the setting MBCSFields in the SURVEY.INI file you can configure fields that should contain Multi-byte characters, such as Chinese, Japanese, Korean, Hebrew, Arabic, without having to save files in Unicode. This way you can use your non-Unicode analysis tool, but are still able to store text from multi-byte languages in data files.

The NIPO CATI / Web Master receives U- and O-file data from the NIPO CATI Clients and/or Web Client in the Unicode format. By default, the NIPO CATI / Web Master converts these files to a non-Unicode (ASCII) format using the configured code page, assuming the configuration was not set to store as Unicode file.

Some (mostly Asian) code pages use multi-byte characters. A single character may be stored in one, two or 3 bytes. Text files store these characters run-length. This means that two lines with an equal amount of characters may need a different amount of actual bytes (positions) to store the text.

During conversion to the configured code page, multi-byte character encoding in the U-file can cause data to be ‘horizontally shifted’. This happens in *ALPHA data fields when entering multi-byte characters only. If these contain characters of varying byte length, subsequent data fields are not properly aligned to the data positions specified in the Q-file. The problem does not occur with characters from a single-byte encoding (the known ‘western’ ASCII range).

The setting MBCSFields in the SURVEY.INI file solves this problem.

Consider the following script:

*Q 10 *ALPHA 61L15
What is your name?

*Q 20 *NUMBER 76L3 *MIN 16
What is your age?

*Q 30 *CODES 80L99 *MULTI
Multiple coded question

In a single byte character set (SBCS) such as in West-European languages, the interviewer may enter a text of up to 15 characters. After conversion from Unicode to the configured code page, the U-file record may look like this:

pos 61 76 80
| | |
..00000Martin Rijks 021000000000000...

Running the same questionnaire in the NIPO Fieldwork System using a multiple byte character set (MBCS) such as Japanese, the interviewer is still allowed to enter a text of up to 15 characters. For storage however each character may need more than one position, the exact length depending on the number of bytes required per character. The code page conversion is unaware of the data length. The positions may end up like this in the U-file (hypothetical assumption):

pos 61 76 86 89
| | | |
..00000???? ?????021000000000000...

While the text ???? ????? consists of only 10 characters, most characters use more than 1 byte to store the character in a MBCS. The result length of the data varies depending on which characters are used. The length of the result of the conversion is not the same as the length of the initial text that was entered by the interviewer.

In short: The length of each U-record in number of bytes is determined by the characters it consists of. Thus, the content determines the length. In a worst case scenario, no record has the same length. The data on position after an *ALPHA field, will vary.

Simply storing data in Unicode format would solve this issue as each character in UTF-16 occupies the same amount of bytes (a 15 character field would always be converted to a 30-byte text). However, some customers need to stick to the non-Unicode format because their data analysis products do not support Unicode. Customers that use multi-byte character code pages need a solution to properly store the *ALPHA input, without causing the data to shift positions.

The solution to this is two-fold: the NIPO ODIN Script writer needs to reserve dummy positions in the Q-file, after the *ALPHA field, into which the NIPO CATI / Web Master may extend the MBCS in the U-file. Second, the NIPO CATI / Web Master needs to be told explicitly which positions may contain multi-byte characters so that these fields may be properly formatted and spaced during code page conversion. This is done by specifying the positions in the survey configuration file:

Survey configuration file setting for MBCS fields

[Config]
MBCSFields=posLlen[,posLlen, ...]

This setting changes to NIPO CATI / Web Master storage of U-file data, in particular the conversion from (internally used) Unicode to MBCS file format.

See Also