blog
HOME · CREATIVE · WEB · TECH · BLOG

Thursday, October 9th, 2008

Unicode and 4D v11

Keisuke Miyako giving a talk on unicode in 4D[These are notes from a session at 4D Summit 2008. After a few sessions this afternoon that really didn't hold my attention, I decided to go to Keisuke Miyako's session on unicode. His session last year had been completely amazing. There aren't a lot of notes simply because I couldn't keep up. He covered the material pretty quickly and it was a case of either paying attention or taking notes - I couldn't do both.]

Presenter: Keisuke Miyako, 4D Japan

The important thing is that the same letter shape can have different code points. For example the latin small 'a' and the cryllic small 'a' look the same, but have different code points. Likewise, the same word in traditional Chinese and simplified Chinese have different code points.

There are different ways to encode Unicode - UTF-8, UTF-16. But even UTF-16 can't show all Unicode characters and sometimes you need two characters to display one code point.

Also realize that there some characters that can be represented by one code point (2 bytes) of a precomposed character or two code points (4 bytes - the second being a combining character - diacriticals, etc.), and there are other characters that may have three code points - say a character with two combining characters.

4D simplifies things and if you have one character displayed with two code points and compare it to the precomposed (one code point) version, 4D will say they are equal despite the fact that they're different lengths.

Keyword indexing works pretty well in most languages, but in some languages (e.g. Thai) words aren't separated by spaces and you have problems.

RECEIVE PACKET uses UTF-8 encoding by default, you'll need to change the encoding to get RECEIVE PACKET to work properly. The best way is to RECEIVE PACKET into a blob, then SET DOCUMENT POSITION and then USE CHARACTER SET.

As far as which encoding you should use. UTF-16 is fine for internal communications, but for external communications you should use UTF-8. UTF-16 can be big endian or little endian - because you don't know what the external system wants you should use UTF-8 which doesn't have little/big endian issues.

Likewise, always use UTF-8 with blobs, but use UTF-16 with text. Blobs are byte oriented.

Be very careful with Replace string and Match regex - they may not work the way you expect.

The debugger is not Unicode compliant - it's the same debugger as in v2004.

He demo'd a component that can create translated xliff files automatically. But "it's not as useful as it looks" because he passes the text through Google Translate which is imperfect. Still, it was very cool to watch and wonder what he was doing.

If you type in a character that isn't supported by the font you've chosen for the field, 4D will switch to the default font for the language of the character you typed. In other words, multiple fonts per field/variable.

In 4D there are new Unicode functions and older non-Unicode functions. The Unicode functions are generally much faster 'cause they're new, optimized code.

Tags: , ,
Categories: 4D

Previous Post: « Menu Management in 4D v11

One Comment

  1. Keisuke MIYAKO Says:

    thank you for attending my session !

    actually, Position and Match regex does work as expected, it’s just the length that might not be what you would expect… so always capture the length with Position/Match regex, pass that length to Substring or whatever and you would be fine.

    also may I point that I was using RECEIVE PACKET twice in the demo for display purposes; normally you would either SET CHARACTER SET and receive text or don’t care about encoding and receive BLOB.

    the demo should be here, unless our webmaster has removed it…

    http://www.4d-japan.com/temp/unicode.zip

Leave a Reply

HOME · CREATIVE · WEB · TECH · BLOG