Java Reference
In-Depth Information
Matching “Accented” or Composite Characters
Problem
You want characters to match regardless of the form in which they are entered.
Solution
Compile the Pattern with the flags argument Pattern.CANON_EQ for “canonical equality.”
Discussion
Composite characters can be entered in various forms. Consider, as a single example, the let-
ter e with an acute accent. This character may be found in various forms in Unicode text,
such as the single character é (Unicode character \u00e9 ) or as the two-character sequence
(e followed by the Unicode combining acute accent, \u0301 ). To allow you to match such
characters regardless of which of possibly multiple “fully decomposed” forms are used to
enter them, the regex package has an option for “canonical matching,” which treats any of
the forms as equivalent. This option is enabled by passing CANON_EQ as (one of) the flags in
the second argument to Pattern.compile() . This program shows CANON_EQ being used to
match several forms:
public
public class
class CanonEqDemo
CanonEqDemo {
public
public static
void main ( String [] args ) {
String pattStr = "\u00e9gal" ; // egal
String [] input = {
"\u00e9gal" , // egal - this one had better match :-)
"e\u0301gal" , // e + "Combining acute accent"
"e\u02cagal" , // e + "modifier letter acute accent"
"e'gal" , // e + single quote
"e\u00b4gal" , // e + Latin-1 "acute"
static void
};
Pattern pattern = Pattern . compile ( pattStr , Pattern . CANON_EQ );
for
for ( int
int i = 0 ; i < input . length ; i ++) {
iif ( pattern . matcher ( input [ i ]). matches ()) {
System . out . println (
pattStr + " matches input " + input [ i ]);
} else
else {
System . out . println (
pattStr + " does not match input " + input [ i ]);
Search WWH ::




Custom Search