Kbase P115168: How to convert Unicode characters to UTF-8?
Autor |
  Progress Software Corporation - Progress |
Acesso |
  Público |
Publicação |
  04/12/2008 |
|
Status: Unverified
GOAL:
How to convert Unicode characters to UTF-8?
GOAL:
Does Progress Support Unicode?
FACT(s) (Environment):
OpenEdge 10.x
FIX:
Progress supports the UTF-8 encoding of Unicode.
UTF-8 is a multi byte (1,2,3,4) encoding scheme to encode 16 bit Unicode values. For example the Euro Symbol in Unicode is represented by the integer value € and by the two byte hexadecimal value 20AC. In UTF-8 the same character is represented by the integer value 14844588, and by the three byte hexadecimal value E282AC. In OpenEdge 10.x Progress intruduced the UTF-8 client, and supports this fully with -cpinternal and -cpstream.
Progress does not support Unicode directly (for -cpinternal and -cpstream) but Progress does support conversions to and fron UCS-2 (Unicode version 1.1) and UTF-16 (Unicode 2.0). So using 4GL it is possible to convert Unicode values to UTF-8 and use them from within Progress. Although it is important to realize that some platforms may order the bytes in multi byte characters differently (endian byte ordering). This could cause problems with such conversions. This aside, as an example the following code shows how to convert a Euro '?' character from Western European 1252, to UTF-8, then UCS-2 and UTF-16, and back to 1252:
DEFINE VARIABLE c1252Euro AS CHARACTER INITIAL "".
DEFINE VARIABLE cUTF-8Euro AS CHARACTER INITIAL "".
DEFINE VARIABLE cUCS2Euro AS CHARACTER INITIAL "".
DEFINE VARIABLE c1252EuroBack AS CHARACTER NO-UNDO.
DEFINE VARIABLE cUTF-16Euro AS CHARACTER INITIAL "".
DEFINE VARIABLE c1252EuroBack2 AS CHARACTER NO-UNDO.
c1252Euro = CHR(128,"1252","1252"). /* Euro is 128(Int) 80(Hex) in 1252 */
cUTF-8Euro = CODEPAGE-CONVERT(c1252Euro, "UTF-8", "1252"). /* Euro is 14844588(Int) e2,82,ac(Hex) in UTF-8 */
cUCS2Euro = CODEPAGE-CONVERT(cUTF-8Euro, "UCS2", "UTF-8"). /* Euro is 8364(Int) 20,AC(Hex) in Unicode (UCS2) */
c1252EuroBack = CODEPAGE-CONVERT(cUCs2Euro, "1252", "UCS2"). /* Euro is 128(Int) 80(Hex) in 1252 */
cUTF-16Euro = CODEPAGE-CONVERT(cUTF-8Euro, "UTF-16", "UTF-8"). /* Euro is 8364(Int) 20,AC(Hex) in Unicode (UTF-16) */
c1252EuroBack2 = CODEPAGE-CONVERT(cUTF-16Euro, "1252", "UTF-16"). /* Euro is 128(Int) 80(Hex) in 1252 */
MESSAGE "1252 Euro : " c1252Euro ASC(c1252Euro, "1252", "1252") SKIP
"UTF-8 Euro : " cUTF-8Euro ASC(cUTF-8Euro,"UTF-8","UTF-8") SKIP
"UCS2 Euro : " cUCS2Euro ASC(cUCS2Euro,"UCS2","UCS2") SKIP
"UTF-16 Euro : " cUTF-16Euro ASC(cUTF-16Euro, "UTF-16", "UTF-16") SKIP(1)
"1252 Euro from UCS2 : " c1252EuroBack ASC(TRIM(c1252EuroBack), "1252", "1252") SKIP
"1252 Euro From UTF-16 : " c1252EuroBack2 ASC(TRIM(c1252EuroBack2), "1252", "1252") SKIP
VIEW-AS ALERT-BOX INFO BUTTONS OK.
Alternative solutions to this could be:
1. Modify the process that created the Unicode data to create a UTF-8 encoded output file.
2. Use 4GL to convert the Unicode characters to UTF-8 via binary. This can be done by converting the Unicode character hexadecimal to its binary bit map, applying the UTF-8 transformation to the bitmap, and then convert the resulting bitmap back to hexadecimal. Although this requires that the characters are delimited so that when reading the hexadecimal values Progress knows where each characters begins and ends.
3. Use a 3rd party editor tool to re-save the file in UTF-8 format. Some editors (e.g. UltraEdit) have such functionality, so the file can be re-saved i.n another code page format. Once the file is saved as UTF-8, then it can be read by the IMPORT statement provided -cpstream is UTF-8 or INPUT FROM CONVERT SOURCE UTF-8 is used. Alternatively, if the editor cannot save a file in UTF-8 format then save in another code page format that Progress can use. Provided -cpstream is set accordingly it should be possible to import the file. .