Advanced MySQL Features - High Performance MySQL

Databases Reference

In-Depth Information

In multibyte character sets, a character is no longer the same as a byte. Consequently,

MySQL has separate LENGTH() and CHAR_LENGTH() functions, which don't return the

same results on multibyte characters. When you're working with multibyte character

sets, be sure to use the CHAR_LENGTH() function when you want to count characters

(e.g., when you're doing SUBSTRING() operations). The same caution holds for multibyte

characters in application languages.

Another possible surprise is index limitations. If you index a UTF-8 column, MySQL

has to assume each character can take up to three bytes, so the usual length restrictions

are suddenly shortened by a factor of three:

mysql> CREATE TABLE big_string(str VARCHAR(500), KEY(str)) DEFAULT CHARSET=utf8;

Query OK, 0 rows affected, 1 warning (0.06 sec)

mysql> SHOW WARNINGS;

+---------+------+---------------------------------------------------------+

| Level | Code | Message |

+---------+------+---------------------------------------------------------+

| Warning | 1071 | Specified key was too long; max key length is 999 bytes |

+---------+------+---------------------------------------------------------+

Notice that MySQL shortened the index to a 333-character prefix automatically:

mysql> SHOW CREATE TABLE big_string\G

*************************** 1. row ***************************

Table: big_string

Create Table: CREATE TABLE `big_string` (

`str` varchar(500) default NULL,

KEY `str` (`str`(333))

) ENGINE=MyISAM DEFAULT CHARSET=utf8

If you didn't notice the warning and check the table definition, you might not have

spotted that the index was created on only a prefix of the column. This will have side

effects such as disabling covering indexes.

Some people recommend that you just use UTF-8 globally to “make your life simpler.”

However, this is not necessarily a good idea if you care about performance. Many ap-

plications don't need to use UTF-8 at all, and depending on your data, UTF-8 can use

much more storage space on disk.

When deciding on a character set, it's important to consider the kind of data you will

store. For example, if you store mostly English text UTF-8 will add practically no stor-

age penalty, because most characters in the English language fit in one byte in UTF-8.

On the other hand, you might see a big difference if you store non-Latin languages such

as Russian or Arabic. An application that needs to store only Arabic could use the

cp1256 character set, which can represent all Arabic characters in one byte. But if the

application needs to store many different languages and you choose UTF-8 instead,

the very same Arabic characters will use more space. Likewise, if you convert a column

from a national character set to UTF-8, you can increase the required storage space

dramatically. If you're using InnoDB, you might increase the data size to the point that

the values don't fit on the page and require external storage, which can cause a lot of

wasted storage space and fragmentation.

High Performance MySQL

Search WWH ::

Custom Search

Home