Consultor Eletrônico

Status: Unverified

SYMPTOM(s):

Thai word indexes don't return the correct records with UTF-8 databases

Searches for single characters return all records where the character exists, even within words.

Searches for a string of characters return all records where those characters exist regardless of the order, even within words.

FACT(s) (Environment):

The database is UTF-8 with basic collation.
The client is the UTF-8 client.
All Supported Operating Systems
OpenEdge 10.1x
OpenEdge 10.2x

CAUSE:

This is expected behavior and is documented in the Internationalizing Applications Manual.

The default word break behavior of characters in multi byte code pages depends on the code page in question and also on the number of bytes that comprise the characters.

With a UTF-8 database, double byte characters by default behave according to the USE_IT word delimiter attribute. Multi-byte characters by default behave as separate words.

Although the default word break behavior of single byte characters in UTF-8 databases can be changed using a custom word break input file, it is not possible to change the default word-break behavior of either double byte or multi byte characters.

FIX:

The only way to work around this limitation could be to use MATCHES after CONTAINS e.g.
FOR EACH customer WHERE
customer.comments CONTAINS cthaichars AND
customer.comments MATCHES "*" + cthaichars + "*":

This should ensure the query finds records with the characters being contiguous and in the correct order. However, it will also return records where the search string is a substring of another word. So to avoid this you could do more ABL manipulations (BEGINS, SUBSTRING, =) to further reduce the records, although this would probably be done after allowing the FOR* EACH or QUERY to return the records with CONTAINS. Subsequently this means using a two-step query:

1) get all the records returned by CONTAINS
2) further trim the records with ABL