Node:Case-sensitivity, Next:Leftmost Longest, Previous:GNU Regexp Operators, Up:Regexp
Case is normally significant in regular expressions, both when matching
ordinary characters (i.e., not metacharacters) and inside character
sets. Thus, a w
in a regular expression matches only a lowercase
w
and not an uppercase W
.
The simplest way to do a case-independent match is to use a character
list--for example, [Ww]
. However, this can be cumbersome if
you need to use it often, and it can make the regular expressions harder
to read. There are two alternatives that you might prefer.
One way to perform a case-insensitive match at a particular point in the
program is to convert the data to a single case, using the
tolower
or toupper
built-in string functions (which we
haven't discussed yet;
see String Manipulation Functions).
For example:
tolower($1) ~ /foo/ { ... }
converts the first field to lowercase before matching against it.
This works in any POSIX-compliant awk
.
Another method, specific to gawk
, is to set the variable
IGNORECASE
to a nonzero value (see Built-in Variables).
When IGNORECASE
is not zero, all regexp and string
operations ignore case. Changing the value of
IGNORECASE
dynamically controls the case-sensitivity of the
program as it runs. Case is significant by default because
IGNORECASE
(like most variables) is initialized to zero:
x = "aB" if (x ~ /ab/) ... # this test will fail IGNORECASE = 1 if (x ~ /ab/) ... # now it will succeed
In general, you cannot use IGNORECASE
to make certain rules
case-insensitive and other rules case-sensitive, because there is no
straightforward way
to set IGNORECASE
just for the pattern of
a particular rule.1
To do this, use either character lists or tolower
. However, one
thing you can do with IGNORECASE
only is dynamically turn
case-sensitivity on or off for all the rules at once.
IGNORECASE
can be set on the command line or in a BEGIN
rule
(see Other Command-Line Arguments; also
see Startup and Cleanup Actions).
Setting IGNORECASE
from the command line is a way to make
a program case-insensitive without having to edit it.
Prior to gawk
3.0, the value of IGNORECASE
affected regexp operations only. It did not affect string comparison
with ==
, !=
, and so on.
Beginning with version 3.0, both regexp and string comparison
operations are also affected by IGNORECASE
.
Beginning with gawk
3.0,
the equivalences between upper-
and lowercase characters are based on the ISO-8859-1 (ISO Latin-1)
character set. This character set is a superset of the traditional 128
ASCII characters, which also provides a number of characters suitable
for use with European languages.
The value of IGNORECASE
has no effect if gawk
is in
compatibility mode (see Command-Line Options).
Case is always significant in compatibility mode.
Experienced C and C++ programmers will note
that it is possible, using something like
IGNORECASE = 1 && /foObAr/ { ... }
and
IGNORECASE = 0 || /foobar/ { ... }
.
However, this is somewhat obscure and we don't recommend it.