Unicode sort order

David Snider · 11-17-1998, 09:28 AM

Does anyone have any experience handling mixed-character data in Unicode, especially in how to handle sorting it?

For example, if I have a customer database with records from all over the world, with customer names in Chinese, Russian, Swedish, Arabic & Hebrew, and I want to generate a list of all customers who have not upgraded to the latst version yet, it will be sorted in the sort order of the locale I defined at DB installation. Say, in this case, US English. How do the Chinese, Arabic, Swedish, etc. characters get sorted?

I'm interested to know if this is something that people are grappling with out there, or whether it's a non-issue. I think what's probably needed is a special Unicode mixed-character sort order, encompasing all the Unicode codepoints and sorting them in an order that's as usable as possible for the general user. (e.g. Latin characters case-sensitive, including accented characters, then Asian characters in stroke order, etc.) Or maybe serveral different mixed-character sort orders would be the thing... Any feedback would be appreciated.

Chris Dart · 11-23-1998, 09:49 PM

David -- the way SQL 7 functions with Unicode has been something I have been asking of every Microsoft person I have encountered since last Spring when I first heard about the multilingual capabilities of 7 (have also plowed through the web site, sent emails, and read everything I could find). To this point I have had no luck at getting any real information but I have some educated guesses as to what I think it will do and I am happy to share. If I ever get the time to play with the product, or hear from Microsoft, I will let you know what I learn. I do Japanese/English and bilingual applications have been an interest of mine for years.

I don't know what you know about unicode, but as a very brief explanation, all apps used to use code pages to make the transition from what the user saw on the screen and the assembly language that actually "ran" the computer. Worked not too bad when the whole world was English, but when apps were added in double byte languages (Japanese, Chinese, Korean, middle eastern, et al) plus all the variations of single byte languages (as in Europe, and the Americas), there were problems. Thus unicode which is essentially a double byte "code page" that has spaces for approx 65,000 codes for languages from all over the world. Each language "group" has a section and each character/letter has a code number. I am guessing that because sort order varies so much from language to language, that the unicode data type sorts by the underlying code number. Much the same way that numbers in texts sort by the asii values rather than the "number". The only other way to do it would be to pick one language and sort by that languages "rules". That could be a real mess. For example, in Japanese, one could sort by a Kanji's reading (Chinese reading or Japanese reading), or the Kanji's radical. If hiragana is involved, one sorts by the a,ka,sa,ta,na (basic "alphabet&#34

. Chinese, being a character based language, would have a similar complexity. It sounds as if your database has a real mix. If you get a book on unicode, you could gain a better understanding of how the "table" is structured and I would imagine, sorted. I would think that the code "sort order" would provide the closest thing to a "mixed character" sort order. Another option would be to have an additional field with the English translation, and sort that.

I do indeed think this is an issue that people should be grappling with. We are becoming a more global society, and its about time that technology become multilingual. Especially with the growth of the Internet. Nice to know someone else is dealing with this. Good luck!
Chris

On 11/17/98 9:28:28 AM, David Snider wrote:
> Does anyone have any experience handling mixed-character data in Unicode,
> especially in how to handle sorting it?

For example, if I have a
> customer database with records from all over the world, with customer names
> in Chinese, Russian, Swedish, Arabic & Hebrew, and I want to generate a
> list of all customers who have not upgraded to the latst version yet, it
> will be sorted in the sort order of the locale I defined at DB
> installation. Say, in this case, US English. How do the Chinese, Arabic,
> Swedish, etc. characters get sorted?

I'm interested to know if this
> is something that people are grappling with out there, or whether it's
> a non-issue. I think what's probably needed is a special Unicode
> mixed-character sort order, encompasing all the Unicode codepoints and
> sorting them in an order that's as usable as possible for the general
> user. (e.g. Latin characters case-sensitive, including accented characters,
> then Asian characters in stroke order, etc.) Or maybe serveral different
> mixed-character sort orders would be the thing... Any feedback would be
> appreciated.

Thread: Unicode sort order

Thread Tools

Display

Unicode sort order

Unicode sort order (reply)

Posting Permissions