Database Reference
In-Depth Information
Character and Binary String Types
The character datatypes in Oracle are
CHAR
,
VARCHAR2
, and their “
N
” variants. The
CHAR
and
NCHAR
can store up to 2,000
bytes of text. The
VARCHAR2
and
NVARCHAR2
can store up to 4,000 bytes of information.
■
starting with Oracle 12
c
,
VARCHAR2
,
NVARCHAR2
, and
RAW
datatypes can be configured to store up to
32,767 bytes of information. extended datatypes are not enabled by default; therefore unless explicitly configured the
maximum size is still 4,000 bytes for
VARCHAR2
and
NVARCHAR2
datatypes and 2,000 bytes for
RAW
. see the “extended
Datatypes” section later in this chapter for more details.
Note
This text is converted between various character sets as needed by the database. A
character set
is a binary
representation of individual characters in bits and bytes. Many different character sets are available, and each is
capable of representing different characters, for example:
US7ASCII
character set is the ASCII standard representation of 128 characters. It uses the
low 7 bits of a byte to represent these 128 characters.
•
The
WE8MSWIN1252
character set is a Western European character set capable of representing
the 128 ASCII characters as well as 128 extended characters, using all 8 bits of a byte.
Before we get into the details of
CHAR
,
VARCHAR2
, and their “
N
” variants, it would benefit us to get a cursory
understanding of what these different character sets mean to us.
•
The
NLS Overview
As stated earlier,
NLS
stands for
National Language Support
. NLS is a very powerful feature of the database, but one
that is often not as well understood as it should be. NLS controls many aspects of our data. For example, it controls
how data is sorted, and whether we see commas and a single period in a number (e.g., 1,000,000.01) or many periods
and a single comma (e.g., 1.000.000,01). But most important, it controls the following:
•
Encoding of the textual data as stored persistently on disk
•
Transparent conversion of data from character set to character set
It is this transparent part that confuses people the most—it is so transparent, you cannot even really see it
happening. Let's look at a small example.
Suppose you are storing 8-bit data in a
WE8MSWIN1252
character set in your database, but you have some clients
that connect using a 7-bit character set such as
US7ASCII
. These clients are not expecting 8-bit data and need to
have the data from the database converted into something they can use. While this sounds wonderful, if you are not
aware of it taking place, then you might well find that your data loses characters over time as the characters that are
not available in
US7ASCII
are translated into some character that is. This is due to the character set translation taking
place. In short, if you retrieve data from the database in character set 1, convert it to character set 2, and then insert
it back (reversing the process), there is a very good chance that you have materially modified the data. Character set
conversion is typically a process that will change the data, and you are usually mapping a large set of characters
(in this example, the set of 8-bit characters) into a smaller set (that of the 7-bit characters). This is a
lossy conversion
—
the characters get modified because it is quite simply not possible to represent every character. But this conversion
must take place. If the database is storing data in a single-byte character set but the client (say, a Java application,
since the Java language uses Unicode) expects it in a multibyte representation, then it must be converted simply so
the client application can work with it.