Databases Reference
In-Depth Information
In multibyte character sets, a character is no longer the same as a byte. Consequently,
MySQL has separate LENGTH() and CHAR_LENGTH() functions, which don't return the
same results on multibyte characters. When you're working with multibyte character
sets, be sure to use the CHAR_LENGTH() function when you want to count characters
(e.g., when you're doing SUBSTRING() operations). The same caution holds for multibyte
characters in application languages.
Another possible surprise is index limitations. If you index a UTF-8 column, MySQL
has to assume each character can take up to three bytes, so the usual length restrictions
are suddenly shortened by a factor of three:
mysql> CREATE TABLE big_string(str VARCHAR(500), KEY(str)) DEFAULT CHARSET=utf8;
Query OK, 0 rows affected, 1 warning (0.06 sec)
mysql> SHOW WARNINGS;
+---------+------+---------------------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------------------+
| Warning | 1071 | Specified key was too long; max key length is 999 bytes |
+---------+------+---------------------------------------------------------+
Notice that MySQL shortened the index to a 333-character prefix automatically:
mysql> SHOW CREATE TABLE big_string\G
*************************** 1. row ***************************
Table: big_string
Create Table: CREATE TABLE `big_string` (
`str` varchar(500) default NULL,
KEY `str` (`str`(333))
) ENGINE=MyISAM DEFAULT CHARSET=utf8
If you didn't notice the warning and check the table definition, you might not have
spotted that the index was created on only a prefix of the column. This will have side
effects such as disabling covering indexes.
Some people recommend that you just use UTF-8 globally to “make your life simpler.”
However, this is not necessarily a good idea if you care about performance. Many ap-
plications don't need to use UTF-8 at all, and depending on your data, UTF-8 can use
much more storage space on disk.
When deciding on a character set, it's important to consider the kind of data you will
store. For example, if you store mostly English text UTF-8 will add practically no stor-
age penalty, because most characters in the English language fit in one byte in UTF-8.
On the other hand, you might see a big difference if you store non-Latin languages such
as Russian or Arabic. An application that needs to store only Arabic could use the
cp1256 character set, which can represent all Arabic characters in one byte. But if the
application needs to store many different languages and you choose UTF-8 instead,
the very same Arabic characters will use more space. Likewise, if you convert a column
from a national character set to UTF-8, you can increase the required storage space
dramatically. If you're using InnoDB, you might increase the data size to the point that
the values don't fit on the page and require external storage, which can cause a lot of
wasted storage space and fragmentation.
 
Search WWH ::




Custom Search