Node:String Functions, Next:I/O Functions, Previous:Numeric Functions, Up:Built-in
The functions in this section look at or change the text of one or more
strings.
Optional parameters are enclosed in square brackets ([ ]).
Those functions that are
specific to gawk
are marked with a pound sign (#
):
\
and
&
with sub
, gsub
, and
gensub
.
asort(source [, dest]) #
asort
is a gawk
-specific extension, returning the number of
elements in the array source. The contents of source are
sorted using gawk
's normal rules for comparing values, and the indices
of the sorted values of source are replaced with sequential
integers starting with one. If the optional array dest is specified,
then source is duplicated into dest. dest is then
sorted, leaving the indices of source unchanged.
For example, if the contents of a
are as follows:
a["last"] = "de" a["first"] = "sac" a["middle"] = "cul"
A call to asort
:
asort(a)
results in the following contents of a
:
a[1] = "cul" a[2] = "de" a[3] = "sac"
The asort
function is described in more detail in
Sorting Array Values and Indices with gawk
.
asort
is a gawk
extension; it is not available
in compatibility mode (see Command-Line Options).
index(in, find)
$ awk 'BEGIN { print index("peanut", "an") }' -| 3
If find is not found, index
returns zero.
(Remember that string indices in awk
start at one.)
length([string])
length("abcde")
is 5. By
contrast, length(15 * 35)
works out to 3. In this example, 15 * 35 =
525, and 525 is then converted to the string "525"
, which has
three characters.
If no argument is supplied, length
returns the length of $0
.
Note:
In older versions of awk
, the length
function could
be called
without any parentheses. Doing so is marked as "deprecated" in the
POSIX standard. This means that while a program can do this,
it is a feature that can eventually be removed from a future
version of the standard. Therefore, for programs to be maximally portable,
always supply the parentheses.
match(string, regexp [, array])
match
function searches string for the
longest, leftmost substring matched by the regular expression,
regexp. It returns the character position, or index,
at which that substring begins (one, if it starts at the beginning of
string). If no match is found, it returns zero.
The order of the first two arguments is backwards from most other string
functions that work with regular expressions, such as
sub
and gsub
. It might help to remember that
for match
, the order is the same as for the ~
operator:
string ~ regexp
.
The match
function sets the built-in variable RSTART
to
the index. It also sets the built-in variable RLENGTH
to the
length in characters of the matched substring. If no match is found,
RSTART
is set to zero, and RLENGTH
to -1.
For example:
{ if ($1 == "FIND") regex = $2 else { where = match($0, regex) if (where != 0) print "Match of", regex, "found at", where, "in", $0 } }
This program looks for lines that match the regular expression stored in
the variable regex
. This regular expression can be changed. If the
first word on a line is FIND
, regex
is changed to be the
second word on that line. Therefore, if given:
FIND ru+n My program runs but not very quickly FIND Melvin JF+KM This line is property of Reality Engineering Co. Melvin was here.
awk
prints:
Match of ru+n found at 12 in My program runs Match of Melvin found at 1 in Melvin was here.
If array is present, it is cleared, and then the 0th element
of array is set to the entire portion of string
matched by regexp. If regexp contains parentheses,
the integer-indexed elements of array are set to contain the
portion of string matching the corresponding parenthesized
subexpression.
For example:
$ echo foooobazbarrrrr | > gawk '{ match($0, /(fo+).+(bar*)/, arr) > print arr[1], arr[2] }' -| foooo barrrrr
The array argument to match
is a
gawk
extension. In compatibility mode
(see Command-Line Options),
using a third argument is a fatal error.
split(string, array [, fieldsep])
array[1]
, the second piece in array[2]
, and so
forth. The string value of the third argument, fieldsep, is
a regexp describing where to split string (much as FS
can
be a regexp describing where to split input records). If
fieldsep is omitted, the value of FS
is used.
split
returns the number of elements created.
The split
function splits strings into pieces in a
manner similar to the way input lines are split into fields. For example:
split("cul-de-sac", a, "-")
splits the string cul-de-sac
into three fields using -
as the
separator. It sets the contents of the array a
as follows:
a[1] = "cul" a[2] = "de" a[3] = "sac"
The value returned by this call to split
is three.
As with input field-splitting, when the value of fieldsep is
" "
, leading and trailing whitespace is ignored, and the elements
are separated by runs of whitespace.
Also as with input field-splitting, if fieldsep is the null string, each
individual character in the string is split into its own array element.
(This is a gawk
-specific extension.)
Modern implementations of awk
, including gawk
, allow
the third argument to be a regexp constant (/abc/
) as well as a
string.
(d.c.)
The POSIX standard allows this as well.
Before splitting the string, split
deletes any previously existing
elements in the array array.
If string does not match fieldsep at all, array has
one element only. The value of that element is the original string.
sprintf(format, expression1, ...)
printf
would
have printed out with the same arguments
(see Using printf
Statements for Fancier Printing).
For example:
pival = sprintf("pi = %.2f (approx.)", 22/7)
assigns the string "pi = 3.14 (approx.)"
to the variable pival
.
strtonum(str) #
0
, strtonum
assumes that str
is an octal number. If str begins with a leading 0x
or
0X
, strtonum
assumes that str is a hexadecimal number.
For example:
$ echo 0x11 | > gawk '{ printf "%d\n", strtonum($1) }' -| 17
Using the strtonum
function is not the same as adding zero
to a string value; the automatic coercion of strings to numbers
works only for decimal data, not for octal or hexadecimal.1
strtonum
is a gawk
extension; it is not available
in compatibility mode (see Command-Line Options).
sub(regexp, replacement [, target])
sub
function alters the value of target.
It searches this value, which is treated as a string, for the
leftmost, longest substring matched by the regular expression regexp.
Then the entire string is
changed by replacing the matched text with replacement.
The modified string becomes the new value of target.
This function is peculiar because target is not simply
used to compute a value, and not just any expression will do--it
must be a variable, field, or array element so that sub
can
store a modified value there. If this argument is omitted, then the
default is to use and alter $0
.
For example:
str = "water, water, everywhere" sub(/at/, "ith", str)
sets str
to "wither, water, everywhere"
, by replacing the
leftmost longest occurrence of at
with ith
.
The sub
function returns the number of substitutions made (either
one or zero).
If the special character &
appears in replacement, it
stands for the precise substring that was matched by regexp. (If
the regexp can match more than one string, then this precise substring
may vary.) For example:
{ sub(/candidate/, "& and his wife"); print }
changes the first occurrence of candidate
to candidate
and his wife
on each input line.
Here is another example:
$ awk 'BEGIN { > str = "daabaaa" > sub(/a+/, "C&C", str) > print str > }' -| dCaaCbaaa
This shows how &
can represent a nonconstant string and also
illustrates the "leftmost, longest" rule in regexp matching
(see How Much Text Matches?).
The effect of this special character (&
) can be turned off by putting a
backslash before it in the string. As usual, to insert one backslash in
the string, you must write two backslashes. Therefore, write \\&
in a string constant to include a literal &
in the replacement.
For example, the following shows how to replace the first |
on each line with
an &
:
{ sub(/\|/, "\\&"); print }
As mentioned, the third argument to sub
must
be a variable, field or array reference.
Some versions of awk
allow the third argument to
be an expression that is not an lvalue. In such a case, sub
still searches for the pattern and returns zero or one, but the result of
the substitution (if any) is thrown away because there is no place
to put it. Such versions of awk
accept expressions
such as the following:
sub(/USA/, "United States", "the USA and Canada")
For historical compatibility, gawk
accepts erroneous code,
such as in the previous example. However, using any other nonchangeable
object as the third parameter causes a fatal error and your program
will not run.
Finally, if the regexp is not a regexp constant, it is converted into a
string, and then the value of that string is treated as the regexp to match.
gsub(regexp, replacement [, target])
sub
function, except gsub
replaces
all of the longest, leftmost, nonoverlapping matching
substrings it can find. The g
in gsub
stands for
"global," which means replace everywhere. For example:
{ gsub(/Britain/, "United Kingdom"); print }
replaces all occurrences of the string Britain
with United
Kingdom
for all input records.
The gsub
function returns the number of substitutions made. If
the variable to search and alter (target) is
omitted, then the entire input record ($0
) is used.
As in sub
, the characters &
and \
are special,
and the third argument must be assignable.
gensub(regexp, replacement, how [, target]) #
gensub
is a general substitution function. Like sub
and
gsub
, it searches the target string target for matches of
the regular expression regexp. Unlike sub
and gsub
,
the modified string is returned as the result of the function and the
original target string is not changed. If how is a string
beginning with g
or G
, then it replaces all matches of
regexp with replacement. Otherwise, how is treated
as a number that indicates which match of regexp to replace. If
no target is supplied, $0
is used.
gensub
provides an additional feature that is not available
in sub
or gsub
: the ability to specify components of a
regexp in the replacement text. This is done by using parentheses in
the regexp to mark the components and then specifying \N
in the replacement text, where N is a digit from 1 to 9.
For example:
$ gawk ' > BEGIN { > a = "abc def" > b = gensub(/(.+) (.+)/, "\\2 \\1", "g", a) > print b > }' -| def abc
As with sub
, you must type two backslashes in order
to get one into the string.
In the replacement text, the sequence \0
represents the entire
matched text, as does the character &
.
The following example shows how you can use the third argument to control
which match of the regexp should be changed:
$ echo a b c a b c | > gawk '{ print gensub(/a/, "AA", 2) }' -| a b c AA b c
In this case, $0
is used as the default target string.
gensub
returns the new string as its result, which is
passed directly to print
for printing.
If the how argument is a string that does not begin with g
or
G
, or if it is a number that is less than or equal to zero, only one
substitution is performed. If how is zero, gawk
issues
a warning message.
If regexp does not match target, gensub
's return value
is the original unchanged value of target.
gensub
is a gawk
extension; it is not available
in compatibility mode (see Command-Line Options).
substr(string, start [, length])
substr("washington", 5, 3)
returns "ing"
.
If length is not present, this function returns the whole suffix of
string that begins at character number start. For example,
substr("washington", 5)
returns "ington"
. The whole
suffix is also returned
if length is greater than the number of characters remaining
in the string, counting from character start.
If start is less than one or greater than the number of characters
in the string, substr
returns the null string.
Similarly, if length is present but less than or equal to zero,
the null string is returned.
The string returned by substr
cannot be
assigned. Thus, it is a mistake to attempt to change a portion of
a string, as shown in the following example:
string = "abcdef" # try to get "abCDEf", won't work substr(string, 3, 3) = "CDE"
It is also a mistake to use substr
as the third argument
of sub
or gsub
:
gsub(/xyz/, "pdq", substr($0, 5, 20)) # WRONG
(Some commercial versions of awk
do in fact let you use
substr
this way, but doing so is not portable.)
If you need to replace bits and pieces of a string, combine substr
with string concatenation, in the following manner:
string = "abcdef" ... string = substr(string, 1, 2) "CDE" substr(string, 6)
tolower(string)
tolower("MiXeD cAsE 123")
returns "mixed case 123"
.
toupper(string)
toupper("MiXeD cAsE 123")
returns "MIXED CASE 123"
.
Unless
you use the --non-decimal-data
option, which isn't recommended.
See Allowing Nondecimal Input Data, for more information.
This is different from C and C++, in which the first character is number zero.