Consultor Eletrônico



Kbase P179451: UTF-8 characters are not stored correctly in the database if entered via ODBC
Autor   Progress Software Corporation - Progress
Acesso   Público
Publicação   1/20/2011
Status: Unverified

SYMPTOM(s):

UTF-8 characters are not stored correctly in the database if entered via ODBC

UTF-8 data inserted via ODBC is not displayed correctly in ABL

UTF-8 data is inserted via ODBC from a C++ client

Data appears to have gone through multiple conversions (e.g. UTF-8 -> 1252 -> UTF-8)

é is displayed as é

é is stored in the database as Ã?©

Multi-byte (UTF-8) hex value of é is C3,A9

Single-byte (1252) hex values for characters displayed instead of é are:

à = C3
© = A9

Multi-byte (UTF-8) hex values for characters displayed instead of é are:

à = C3,83
© = C2,A9

Single-byte (1252) hex values for characters stored in the database are:

à = C3
? = 83
 = C2
© = A9

C++ application is not Unicode enabled

C++ application uses ANSI ODBC types, e.g. SQL_C_CHAR

C++ application converts Unicode characters to UTF-8 so they will fit in the SQL_C_CHAR type

Setting SQL_CLIENT_CHARSET has no effect

FACT(s) (Environment):

OpenEdge database uses code page UTF-8
Problem does not occur when inserting/displaying data via ABL client (-cpinternal UTF-8 -cpstream UTF-8)
Problem does not occur when inserting/displaying data via JDBC
Problem did not occur in OpenEdge 10.0B
Problem is not reproducible using a standard ODBC client (e.g. WinSQL, SQLCON32, ODBC Test)
Data is displayed correctly using a SQL client that supports UNICODE (e.g. Crystal Reports, Microsoft Excel)
OpenEdge 10.1x
OpenEdge 10.2x
Linux
Windows

CAUSE:

Additional code page conversions occur between the C++ client and ODBC driver due to the way in which data is sent to the driver. UTF-8 data is multi-byte but is stored by the application in an ANSI data type (SQL_C_CHAR). The driver receives UTF-8 data but assumes that it is single-byte and converts it to the Active Code Page of the client operating system (e.g. 1252 on Windows configured for USA / Western Europe). As a result, a multi-byte UTF-8 character (for example C3,A9) is interpreted as single-byte and split into component bytes (C3 and A9). When the data reaches the UTF-8 OpenEdge database, these characters are interpreted as multi-byte characters and stored in UTF-8 format (C3,83 and C2,A9).

FIX:

If working with Unicode data, the wide "W" ODBC function calls designed for Unicode should be used, e.g. SQL_C_WCHAR.