Node:Character Lists, Next:GNU Regexp Operators, Previous:Regexp Operators, Up:Regexp
Within a character list, a range expression consists of two
characters separated by a hyphen. It matches any single character that
sorts between the two characters, using the locale's
collating sequence and character set. For example, in the default C
locale, [a-dx-z]
is equivalent to [abcdxyz]
. Many locales
sort characters in dictionary order, and in these locales,
[a-dx-z]
is typically not equivalent to [abcdxyz]
; instead it
might be equivalent to [aBbCcDdxXyYz]
, for example. To obtain
the traditional interpretation of bracket expressions, you can use the C
locale by setting the LC_ALL
environment variable to the value
C
.
To include one of the characters \
, ]
, -
, or ^
in a
character list, put a \
in front of it. For example:
[d\]]
matches either d
or ]
.
This treatment of \
in character lists
is compatible with other awk
implementations and is also mandated by POSIX.
The regular expressions in awk
are a superset
of the POSIX specification for Extended Regular Expressions (EREs).
POSIX EREs are based on the regular expressions accepted by the
traditional egrep
utility.
Character classes are a new feature introduced in the POSIX standard. A character class is a special notation for describing lists of characters that have a specific attribute, but the actual characters can vary from country to country and/or from character set to character set. For example, the notion of what is an alphabetic character differs between the United States and France.
A character class is only valid in a regexp inside the
brackets of a character list. Character classes consist of [:
,
a keyword denoting the class, and :]
. Here are the character
classes defined by the POSIX standard.
[:alnum:] | Alphanumeric characters.
|
[:alpha:] | Alphabetic characters.
|
[:blank:] | Space and TAB characters.
|
[:cntrl:] | Control characters.
|
[:digit:] | Numeric characters.
|
[:graph:] | Characters that are both printable and visible.
(A space is printable but not visible, whereas an a is both.)
|
[:lower:] | Lowercase alphabetic characters.
|
[:print:] | Printable characters (characters that are not control characters).
|
[:punct:] | Punctuation characters (characters that are not letters, digits,
control characters, or space characters).
|
[:space:] | Space characters (such as space, TAB, and formfeed, to name a few).
|
[:upper:] | Uppercase alphabetic characters.
|
[:xdigit:] | Characters that are hexadecimal digits.
|
For example, before the POSIX standard, you had to write /[A-Za-z0-9]/
to match alphanumeric characters. If your
character set had other alphabetic characters in it, this would not
match them, and if your character set collated differently from
ASCII, this might not even match the ASCII alphanumeric characters.
With the POSIX character classes, you can write
/[[:alnum:]]/
to match the alphabetic
and numeric characters in your character set.
Two additional special sequences can appear in character lists. These apply to non-ASCII character sets, which can have single symbols (called collating elements) that are represented with more than one character. They can also have several characters that are equivalent for collating, or sorting, purposes. (For example, in French, a plain "e" and a grave-accented "è" are equivalent.) These sequences are:
[.
and .]
. For example, if ch
is a collating element,
then [[.ch.]]
is a regexp that matches this collating element, whereas
[ch]
is a regexp that matches either c
or h
.
[=
and =]
.
For example, the name e
might be used to represent all of
"e," "è," and "é." In this case, [[=e=]]
is a regexp
that matches any of e
, é
, or è
.
These features are very valuable in non-English-speaking locales.
Caution: The library functions that gawk
uses for regular
expression matching currently recognize only POSIX character classes;
they do not recognize collating symbols or equivalence classes.