Multi-byte Character

It was generally a nice experience along the way using wordpress to support our OHeHLium site. You can always find easy-to-use ways to meet various needs, such as audio plugin for scorp’s demanding for playing audio/mp3 within her post, and WPvideo 1.02 for displaying Stone’s great finding from Youtube.

Yet, there were some defects. The most annoying is the chinese character support. I should say this problem shouldn’t be ascribed to wordpress, but I would rather be a little harsh here :p . When displaying the_excerpt or something similar, there will be some unknown character like “,” at the end. That’s because the_excerpt tends/(is supposed to) to retrieve only part of the whole content, such as the first 150 characters. And then the problem comes out: the script will get the 150 bytes instead of characters (like substr in php), which is not a problem for English characters but a nightmare for Chinese. Usually it’s 2 bytes for Chinese character. So if fortunatelly, there are only Chinese character or mingled with even number English characters, it will not be a problem using 150 or any even number to get the short version. But if not, bad things will happen. It seems only a 50/50 chance, not that bad. The UTF-8 encoding then uses 3 bytes for a Chinese character, so a larger chance for bad things :) . Anyway, it’s multi-byte for Chinese characters. So I need to find a way to resolve this, as the most posts we are writing are composed of both Chinese and English characters.

After googling, this was the solution for PHP:

mb_internal_encoding(“UTF-8″);
$the_excerpt=mb_substr($the_content, 0, 16);

However, I was still not able to deal with the mysql command directly, though it is said SUBSTRING in mysql is multi-byte safe. My guess would be the setting for the mysql server is not utf8, but I was no luck only trying to use “SET NAMES ‘utf8′“. There always is something more for learning :)

Comments

  1. Stone wrote:

    原来这么复杂呢

  2. liang wrote:

    hehe,其实我还先试了一个更好玩的,就是数着有多少个英文字符,如果奇数个,我就再从150个里减掉一个数(当时我不知道UTF8编码是三个byte一个中文字的),所以怎么也还是不行。应该数着数,然后弄成被三整除的就好了 :)
    不过后来搜到的方法就更简单些了..

Post a Comment

Your email is never published nor shared. Required fields are marked *

*

*