MySQL Character Set and Collation

MySQL documentation states the following in respect to "character set" and "collation" of the data storage in the database.

A character set is a set of symbols and encodings. A collation is a set of rules for comparing characters in a character set. Let's make the distinction clear with an example of an imaginary character set.

Suppose that we have an alphabet with four letters: 'A', 'B', 'a', 'b'. We give each letter a number: 'A' = 0, 'B' = 1, 'a' = 2, 'b' = 3. The letter 'A' is a symbol, the number 0 is the encoding for 'A', and the combination of all four letters and their encodings is a character set.

Suppose that we want to compare two string values, 'A' and 'B'. The simplest way to do this is to look at the encodings: 0 for 'A' and 1 for 'B'. Because 0 is less than 1, we say 'A' is less than 'B'. What we've just done is apply a collation to our character set. The collation is a set of rules (only one rule in this case): compare the encodings. We call this simplest of all possible collations a binary collation.

But what if we want to say that the lowercase and uppercase letters are equivalent? Then we would have at least two rules: (1) treat the lowercase letters 'a' and 'b' as equivalent to 'A' and 'B'; (2) then compare the encodings. We call this a case-insensitive collation. It's a little more complex than a binary collation.

-- End of MySQL documentation citation

Starting with MySQL version 4.1, there are five levels to which character sets and collations can be applied - server, database, table, column and string constant. The rules for applying character set and collation occurs according to precedence. The order of precedence is string constant, column, table, database and server. The character set and collation specific to a particular string constant and column apply first. If those don't exist, the table default is used, and if that doesn't exist the database default, and finally the server default. By default, MySQL uses the latin1 character set, and the latin1_swedish_ci collation. You can see which defaults apply to your installation by running the folloing statements:

mysql> SHOW VARIABLES LIKE 'character_set%';
| Variable_name            | Value                      |
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
7 rows in set (0.00 sec) 

The server defaults are set up in the configuration file (/etc/my.cnf). For example:


To create a table or a database that makes use of a different character set and collation, simply use the CHARACTER SET (CHARSET is a synonym) and COLLATION clauses.

mysql> CREATE TABLE tablename(column definitions, ...) CHARACTER SET euckr 
COLLATE euckr_korean_ci;
Query OK, 0 rows affected (0.12 sec)

mysql> CREATE DATABASE dbname CHARACTER SET euckr COLLATE euckr_korean_ci;
Query OK, 1 row affected (0.05 sec)

-- If you already have a database that you need to change character set, 
-- use alter database command.
mysql> ALTER DATABASE dbname CHARACTER SET euckr COLLATE euckr_korean_ci;

Client, Results and Connection CHARACTER SET and COLLATION
Setting the correct character set on the server is only a half story. In order to display characters correctly on the client (be it a GUI or Web), it is important how the client gets the data from the server. To establish a client connection to the server with a desired character set, we need to define client connection settings with variables.

set character_set_client = euckr;
set character_set_results = euckr;
set character_set_connection = euckr;

Since those three variables are likely changed at the same time, MySQL offers a command that will allow you to change all three with one command.

set names = euckr;
set names euckr collate euckr_korean_ci;

If you wish to setup a server level character set, you may also execute the mysqld with a command-line options. For example, in the /etc/rc.d/init.d/mysqld (for redhat):

/usr/bin/mysqld_safe --defaults-file=/etc/my.cnf --pid-file="$mypidfile" --character-set-server=euckr --collation-server=euckr_korean_ci >/dev/null 2>&1 &

To see available character set, use the following MySQL command.

mysql> show character set

-- To change the database to UTF8, use the following:

COLLATE utf8_general_ci; 

If you move your application from one server to another and the character set does not appear correctly, chances are that you have not set client-side character set correctly. Open a library file where database connection is made, and add the following statement just after selecting a database. This is an example using PHP programming language.

$db = mysql_connect("$host", "$user", "$pwd");
mysql_query("set names euckr collate euckr_korean_ci");

Replace the Korean character set with whichever character set that you would like to use.



Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.