Multi-byte Character
It was generally a nice experience along the way using wordpress to support our OHeHLium site. You can always find easy-to-use ways to meet various needs, such as audio plugin for scorp’s demanding for playing audio/mp3 within her post, and WPvideo 1.02 for displaying Stone’s great finding from Youtube.
Yet, there were some defects. The most annoying is the chinese character support. I should say this problem shouldn’t be ascribed to wordpress, but I would rather be a little harsh here :p . When displaying the_excerpt or something similar, there will be some unknown character like “,” at the end. That’s because the_excerpt tends/(is supposed to) to retrieve only part of the whole content, such as the first 150 characters. And then the problem comes out: the script will get the 150 bytes instead of characters (like substr in php), which is not a problem for English characters but a nightmare for Chinese. Usually it’s 2 bytes for Chinese character. So if fortunatelly, there are only Chinese character or mingled with even number English characters, it will not be a problem using 150 or any even number to get the short version. But if not, bad things will happen. It seems only a 50/50 chance, not that bad. The UTF-8 encoding then uses 3 bytes for a Chinese character, so a larger chance for bad things
. Anyway, it’s multi-byte for Chinese characters. So I need to find a way to resolve this, as the most posts we are writing are composed of both Chinese and English characters.
After googling, this was the solution for PHP:
mb_internal_encoding(“UTF-8″);
$the_excerpt=mb_substr($the_content, 0, 16);
However, I was still not able to deal with the mysql command directly, though it is said SUBSTRING in mysql is multi-byte safe. My guess would be the setting for the mysql server is not utf8, but I was no luck only trying to use “SET NAMES ‘utf8′“. There always is something more for learning
Stone wrote:
原来这么复杂呢
Posted 22 Aug 2006 at 2:21 am ¶
liang wrote:
hehe,其实我还先试了一个更好玩的,就是数着有多少个英文字符,如果奇数个,我就再从150个里减掉一个数(当时我不知道UTF8编码是三个byte一个中文字的),所以怎么也还是不行。应该数着数,然后弄成被三整除的就好了
不过后来搜到的方法就更简单些了..
Posted 22 Aug 2006 at 8:38 am ¶