String comparisons are somewhat more complex than numeric or temporal comparisons. Numbers sort in numeric order and dates and times sort in temporal order, but string comparisons depend not only on the specific content of the strings, but on whether they are non-binary or binary. A letter in uppercase may compare as the same or different than the same letter in lowercase, and a letter with one type of accent may be considered the same or different than that letter with another type of accent. Here is a summary:

  • A non-binary string contains characters from a particular character set, and is associated with one of the collations (sorting orders) available for the character set. Characters may consist of single or multiple bytes. A collation can be case insensitive (lettercase is not significant), case sensitive (lettercase is significant), or binary (comparisons are based on numeric character values).
  • A binary string is treated as raw bytes. It has no character set and no collation. Comparisons between binary strings are based on numeric byte values.

The rules that govern string comparison apply in several ways. They determine the result of comparisons performed explicitly with operators such as = and <, and comparisons performed implicitly by ORDER BY, GROUP BY, and DISTINCT operations.

The default character set and collation for literal strings depend on the values of the character_set_connection and collation_connection system variables. The default character set is latin1. The default collation is latin1_swedish_ci, which is case insensitive as indicated by the "_ci" at the end of the collation name. Assuming these connection settings, literal strings are not case sensitive by default. You can see this by comparing strings that differ only in lettercase:

mysql> SELECT 'Hello' = 'hello';
+-------------------+
| 'Hello' = 'hello' |
+-------------------+
|                 1 |
+-------------------+

A given collation might cause certain accented characters to compare the same as other characters. For example, 'ü' and 'ue' are different in the default latin1_swedish_ci collation, but with the latin1_german2_ci collation ("German phone-book" collation), they have the same sort value and thus compare as equal:

mysql> SELECT 'Müller' = 'Mueller';
+----------------------+
| 'Müller' = 'Mueller' |
+----------------------+
|                    0 |
+----------------------+

mysql> SET collation_connection = latin1_german2_ci;

mysql> SELECT 'Müller' = 'Mueller';
+----------------------+
| 'Müller' = 'Mueller' |
+----------------------+
|                    1 |
+----------------------+

For binary strings, lettercase is significant. However, this is not because binary strings are case sensitive per se, because binary strings have no character set. Rather, it is because uppercase and lowercase versions of a character have different numeric values.  A non-binary string can be treated as a binary string by preceding it with the BINARY keyword. If either string in a comparison is binary, both strings are treated as binary:

mysql> SELECT BINARY 'Hello' = 'hello';
+--------------------------+
| BINARY 'Hello' = 'hello' |
+--------------------------+
|                        0 |
+--------------------------+

mysql> SELECT 'Hello' = BINARY 'hello';
+--------------------------+
| 'Hello' = BINARY 'hello' |
+--------------------------+
|                        0 |
+--------------------------+

The sorting principles just described were demonstrated using literal strings, but the same principles apply to string-valued table columns. Suppose that a table t contains a column c and has the following rows:

mysql> SELECT c FROM t;
+-----------+
| c         |
+-----------+
| Hello     |
| goodbye   |
| Bonjour   |
| au revoir |
+-----------+

If c is a CHAR column that has the latin1_swedish_ci collation, it is a non-binary column with a case-insensitive collation. Uppercase and lowercase letters are treated as identical and a sort operation that uses ORDER BY produces results like this:

mysql> SELECT c FROM t ORDER BY c;
+-----------+
| c         |
+-----------+
| au revoir |
| Bonjour   |
| goodbye   |
| Hello     |
+-----------+

If c is declared as a BINARY column instead, it has no character set or collation. ORDER BY sorts using raw byte codes and produces a different result. Assuming that the values are stored on a machine that uses ASCII codes, the numeric values for uppercase letters precede those for lowercase letters and the result looks like this:

mysql> SELECT c FROM t ORDER BY c;
+-----------+
| c         |
+-----------+
| Bonjour   |
| Hello     |
| au revoir |
| goodbye   |
+-----------+

String comparison rules also apply to GROUP BY and DISTINCT operations. Suppose that t has a column c with the following contents:

mysql> SELECT c FROM t;
+---------+
| c       |
+---------+
| Hello   |
| hello   |
| Goodbye |
| goodbye |
+---------+

If c is a non-binary, case-insensitive column, GROUP BY and DISTINCT do not make lettercase distinctions:

mysql> SELECT c, COUNT(*) FROM t GROUP BY c;
+---------+----------+
| c       | COUNT(*) |
+---------+----------+
| Goodbye |        2 |
| Hello   |        2 |
+---------+----------+

mysql> SELECT DISTINCT c FROM t;
+---------+
| c       |
+---------+
| Hello   |
| Goodbye |
+---------+

On the other hand, if c is a BINARY column, those operations use byte values for sorting:

mysql> SELECT c, COUNT(*) FROM t GROUP BY c;
+---------+----------+
| c       | COUNT(*) |
+---------+----------+
| Goodbye |        1 |
| Hello   |        1 |
| goodbye |        1 |
| hello   |        1 |
+---------+----------+

mysql> SELECT DISTINCT c FROM t;
+---------+
| c       |
+---------+
| Hello   |
| hello   |
| Goodbye |
| goodbye |
+---------+

The preceding discussion shows that to understand sorting and comparison behavior for strings, it's important to know whether they are non-binary or binary. This is important when using string functions as well. String functions may treat their arguments as non-binary or binary strings, or return binary or non-binary results. It depends on the function. Here are some examples:

  • LENGTH() returns the length of a string in bytes, whereas >CHAR_LENGTH() returns the length in characters. For strings that contain only single-byte characters, the two functions return identical results. For strings that contain multi-byte characters, you should choose the function that is appropriate for the type of result you want. For example, the sjis character set includes characters that require two bytes to represent. The value of LENGTH() for any string containing such characters will be greater than the value of CHAR_LENGTH().
  • The UPPER() and LOWER() functions perform case conversion only if the argument is a non-binary string. Suppose that 'AbCd' is non-binary. In that case, the two functions return a value in the requested lettercase:

    mysql> SELECT UPPER('AbCd'), LOWER('AbCd');
    +---------------+---------------+
    | UPPER('AbCd') | LOWER('AbCd') |
    +---------------+---------------+
    | ABCD          | abcd          |
    +---------------+---------------+
    

    However, if 'AbCd' is a binary string, it has no character set. In that case, the concept of lettercase does not apply, and UPPER() and LOWER() do nothing:

    mysql> SELECT UPPER(BINARY 'AbCd'), LOWER(BINARY 'AbCd');
    +----------------------+----------------------+
    | UPPER(BINARY 'AbCd') | LOWER(BINARY 'AbCd') |
    +----------------------+----------------------+
    | AbCd                 | AbCd                 |
    +----------------------+----------------------+
    

    To make the two functions perform case conversion for a binary string, convert it to a non-binary string. For example:

    mysql> SELECT UPPER(CONVERT(BINARY 'AbCd' USING latin1));
    +--------------------------------------------+
    | UPPER(CONVERT(BINARY 'AbCd' USING latin1)) |
    +--------------------------------------------+
    | ABCD                                       |
    +--------------------------------------------+
    
  • MD5() takes a string argument and produces a 32-byte checksum represented as a string of hexadecimal digits. It treats its argument as a binary string:

    mysql> SELECT MD5('a');
    +----------------------------------+
    | MD5('a')                         |
    +----------------------------------+
    | 0cc175b9c0f1b6a831c399e269772661 |
    +----------------------------------+
    
    mysql> SELECT MD5('A');
    +----------------------------------+
    | MD5('A')                         |
    +----------------------------------+
    | 7fc56270e7a70fa81a5935b72eacbe29 |
    +----------------------------------+
    

Posted in: MySQL

Related FAQ's

Marius Ion ANGEL HOT SOFT LLC (800) 316-7677