Node:Regexp Operators, Next:Character Lists, Previous:Escape Sequences, Up:Regexp
You can combine regular expressions with special characters, called regular expression operators or metacharacters, to increase the power and versatility of regular expressions.
The escape sequences described
earlier
in Escape Sequences,
are valid inside a regexp. They are introduced by a \
and
are recognized and converted into corresponding real characters as
the very first step in processing regexps.
Here is a list of metacharacters. All characters that are not escape sequences and that are not listed in the table stand for themselves:
\
\$
matches the character $
.
^
^@chapter
matches @chapter
at the beginning of a string and can be used
to identify chapter beginnings in Texinfo source files.
The ^
is known as an anchor, because it anchors the pattern to
match only at the beginning of the string.
It is important to realize that ^
does not match the beginning of
a line embedded in a string.
The condition is not true in the following example:
if ("line1\nLINE 2" ~ /^L/) ...
$
^
, but it matches only at the end of a string.
For example, p$
matches a record that ends with a p
. The $
is an anchor
and does not match the end of a line embedded in a string.
The condition in the following example is not true:
if ("line1\nLINE 2" ~ /1$/) ...
.
.P
matches any single character followed by a P
in a string. Using
concatenation, we can make a regular expression such as U.A
, which
matches any three-character sequence that begins with U
and ends
with A
.
In strict POSIX mode (see Command-Line Options),
.
does not match the NUL
character, which is a character with all bits equal to zero.
Otherwise, NUL is just another character. Other versions of awk
may not be able to match the NUL character.
[...]
[MVX]
matches any one of
the characters M
, V
, or X
in a string. A full
discussion of what can be inside the square brackets of a character list
is given in
Using Character Lists.
[^ ...]
[
must be a ^
. It matches any characters
except those in the square brackets. For example, [^awk]
matches any character that is not an a
, w
,
or k
.
|
|
has the lowest precedence of all the regular
expression operators.
For example, ^P|[[:digit:]]
matches any string that matches either ^P
or [[:digit:]]
. This
means it matches any string that starts with P
or contains a digit.
The alternation applies to the largest possible regexps on either side.
(...)
|
. For example,
@(samp|code)\{[^}]+\}
matches both @code{foo}
and
@samp{bar}
.
(These are Texinfo formatting control sequences.)
*
ph*
applies the *
symbol to the preceding h
and looks for matches
of one p
followed by any number of h
s. This also matches
just p
if no h
s are present.
The *
repeats the smallest possible preceding expression.
(Use parentheses if you want to repeat a larger expression.) It finds
as many repetitions as possible. For example,
awk '/\(c[ad][ad]*r x\)/ { print }' sample
prints every record in sample
containing a string of the form
(car x)
, (cdr x)
, (cadr x)
, and so on.
Notice the escaping of the parentheses by preceding them
with backslashes.
+
*
, except that the preceding expression must be
matched at least once. This means that wh+y
would match why
and whhy
, but not wy
, whereas
wh*y
would match all three of these strings.
The following is a simpler
way of writing the last *
example:
awk '/\(c[ad]+r x\)/ { print }' sample
?
*
, except that the preceding expression can be
matched either once or not at all. For example, fe?d
matches fed
and fd
, but nothing else.
{n}
{n,}
{n,m}
wh{3}y
whhhy
, but not why
or whhhhy
.
wh{3,5}y
whhhy
, whhhhy
, or whhhhhy
, only.
wh{2,}y
whhy
or whhhy
, and so on.
Interval expressions were not traditionally available in awk
.
They were added as part of the POSIX standard to make awk
and egrep
consistent with each other.
However, because old programs may use {
and }
in regexp
constants, by default gawk
does not match interval expressions
in regexps. If either --posix
or --re-interval
are specified
(see Command-Line Options), then interval expressions
are allowed in regexps.
For new programs that use {
and }
in regexp constants,
it is good practice to always escape them with a backslash. Then the
regexp constants are valid and work the way you want them to, using
any version of awk
.2
In regular expressions, the *
, +
, and ?
operators,
as well as the braces {
and }
,
have
the highest precedence, followed by concatenation, and finally by |
.
As in arithmetic, parentheses can change how operators are grouped.
In POSIX awk
and gawk
, the *
, +
, and ?
operators
stand for themselves when there is nothing in the regexp that precedes them.
For example, /+/
matches a literal plus sign. However, many other versions of
awk
treat such a usage as a syntax error.
If gawk
is in compatibility mode
(see Command-Line Options),
POSIX character classes and interval expressions are not available in
regular expressions.
In other literature, you may see a character list referred to as either a character set, a character class, or a bracket expression.
Use two backslashes if you're using a string constant with a regexp operator or function.