Java Reference
In-Depth Information
Matching “Accented” or Composite Characters
Problem
You want characters to match regardless of the form in which they are entered.
Solution
Compile the
Pattern
with the
flags
argument
Pattern.CANON_EQ
for “canonical equality.”
Discussion
Composite characters can be entered in various forms. Consider, as a single example, the let-
ter
e
with an acute accent. This character may be found in various forms in Unicode text,
such as the single character
é
(Unicode character
\u00e9
) or as the two-character sequence
e´
(e followed by the Unicode combining acute accent,
\u0301
). To allow you to match such
characters regardless of which of possibly multiple “fully decomposed” forms are used to
enter them, the regex package has an option for “canonical matching,” which treats any of
the forms as equivalent. This option is enabled by passing
CANON_EQ
as (one of) the flags in
the second argument to
Pattern.compile()
. This program shows
CANON_EQ
being used to
match several forms:
public
public class
class
CanonEqDemo
CanonEqDemo
{
public
public static
void
main
(
String
[]
args
) {
String pattStr
=
"\u00e9gal"
;
// egal
String
[]
input
= {
"\u00e9gal"
,
// egal - this one had better match :-)
"e\u0301gal"
,
// e + "Combining acute accent"
"e\u02cagal"
,
// e + "modifier letter acute accent"
"e'gal"
,
// e + single quote
"e\u00b4gal"
,
// e + Latin-1 "acute"
static
void
};
Pattern pattern
=
Pattern
.
compile
(
pattStr
,
Pattern
.
CANON_EQ
);
for
for
(
int
int
i
=
0
;
i
<
input
.
length
;
i
++) {
iif
(
pattern
.
matcher
(
input
[
i
]).
matches
()) {
System
.
out
.
println
(
pattStr
+
" matches input "
+
input
[
i
]);
}
else
else
{
System
.
out
.
println
(
pattStr
+
" does not match input "
+
input
[
i
]);