Go to the first, previous, next, last section, table of contents.


Arrays in awk

An array is a table of values, called elements. The elements of an array are distinguished by their indices. Indices may be either numbers or strings. awk maintains a single set of names that may be used for naming variables, arrays and functions (see section User-defined Functions). Thus, you cannot have a variable and an array with the same name in the same awk program.

Introduction to Arrays

The awk language provides one-dimensional arrays for storing groups of related strings or numbers.

Every awk array must have a name. Array names have the same syntax as variable names; any valid variable name would also be a valid array name. But you cannot use one name in both ways (as an array and as a variable) in one awk program.

Arrays in awk superficially resemble arrays in other programming languages; but there are fundamental differences. In awk, you don't need to specify the size of an array before you start to use it. Additionally, any number or string in awk may be used as an array index, not just consecutive integers.

In most other languages, you have to declare an array and specify how many elements or components it contains. In such languages, the declaration causes a contiguous block of memory to be allocated for that many elements. An index in the array usually must be a positive integer; for example, the index zero specifies the first element in the array, which is actually stored at the beginning of the block of memory. Index one specifies the second element, which is stored in memory right after the first element, and so on. It is impossible to add more elements to the array, because it has room for only as many elements as you declared. (Some languages allow arbitrary starting and ending indices, e.g., `15 .. 27', but the size of the array is still fixed when the array is declared.)

A contiguous array of four elements might look like this, conceptually, if the element values are eight, "foo", "" and 30:

Only the values are stored; the indices are implicit from the order of the values. Eight is the value at index zero, because eight appears in the position with zero elements before it.

Arrays in awk are different: they are associative. This means that each array is a collection of pairs: an index, and its corresponding array element value:

Element 4     Value 30
Element 2     Value "foo"
Element 1     Value 8
Element 3     Value ""

We have shown the pairs in jumbled order because their order is irrelevant.

One advantage of associative arrays is that new pairs can be added at any time. For example, suppose we add to the above array a tenth element whose value is "number ten". The result is this:

Element 10    Value "number ten"
Element 4     Value 30
Element 2     Value "foo"
Element 1     Value 8
Element 3     Value ""

Now the array is sparse, which just means some indices are missing: it has elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.

Another consequence of associative arrays is that the indices don't have to be positive integers. Any number, or even a string, can be an index. For example, here is an array which translates words from English into French:

Element "dog" Value "chien"
Element "cat" Value "chat"
Element "one" Value "un"
Element 1     Value "un"

Here we decided to translate the number one in both spelled-out and numeric form--thus illustrating that a single array can have both numbers and strings as indices. (In fact, array subscripts are always strings; this is discussed in more detail in section Using Numbers to Subscript Arrays.)

The value of IGNORECASE has no effect upon array subscripting. You must use the exact same string value to retrieve an array element as you used to store it.

When awk creates an array for you, e.g., with the split built-in function, that array's indices are consecutive integers starting at one. (See section Built-in Functions for String Manipulation.)

Referring to an Array Element

The principal way of using an array is to refer to one of its elements. An array reference is an expression which looks like this:

array[index]

Here, array is the name of an array. The expression index is the index of the element of the array that you want.

The value of the array reference is the current value of that array element. For example, foo[4.3] is an expression for the element of array foo at index `4.3'.

If you refer to an array element that has no recorded value, the value of the reference is "", the null string. This includes elements to which you have not assigned any value, and elements that have been deleted (see section The delete Statement). Such a reference automatically creates that array element, with the null string as its value. (In some cases, this is unfortunate, because it might waste memory inside awk.)

You can find out if an element exists in an array at a certain index with the expression:

index in array

This expression tests whether or not the particular index exists, without the side effect of creating that element if it is not present. The expression has the value one (true) if array[index] exists, and zero (false) if it does not exist.

For example, to test whether the array frequencies contains the index `2', you could write this statement:

if (2 in frequencies)
    print "Subscript 2 is present."

Note that this is not a test of whether or not the array frequencies contains an element whose value is two. (There is no way to do that except to scan all the elements.) Also, this does not create frequencies[2], while the following (incorrect) alternative would do so:

if (frequencies[2] != "")
    print "Subscript 2 is present."

Assigning Array Elements

Array elements are lvalues: they can be assigned values just like awk variables:

array[subscript] = value

Here array is the name of your array. The expression subscript is the index of the element of the array that you want to assign a value. The expression value is the value you are assigning to that element of the array.

Basic Array Example

The following program takes a list of lines, each beginning with a line number, and prints them out in order of line number. The line numbers are not in order, however, when they are first read: they are scrambled. This program sorts the lines by making an array using the line numbers as subscripts. It then prints out the lines in sorted order of their numbers. It is a very simple program, and gets confused if it encounters repeated numbers, gaps, or lines that don't begin with a number.

{
  if ($1 > max)
    max = $1
  arr[$1] = $0
}

END {
  for (x = 1; x <= max; x++)
    print arr[x]
}

The first rule keeps track of the largest line number seen so far; it also stores each line into the array arr, at an index that is the line's number.

The second rule runs after all the input has been read, to print out all the lines.

When this program is run with the following input:

5  I am the Five man
2  Who are you?  The new number two!
4  . . . And four on the floor
1  Who is number one?
3  I three you.

its output is this:

1  Who is number one?
2  Who are you?  The new number two!
3  I three you.
4  . . . And four on the floor
5  I am the Five man

If a line number is repeated, the last line with a given number overrides the others.

Gaps in the line numbers can be handled with an easy improvement to the program's END rule:

END {
  for (x = 1; x <= max; x++)
    if (x in arr)
      print arr[x]
}

Scanning All Elements of an Array

In programs that use arrays, you often need a loop that executes once for each element of an array. In other languages, where arrays are contiguous and indices are limited to positive integers, this is easy: you can find all the valid indices by counting from the lowest index up to the highest. This technique won't do the job in awk, since any number or string can be an array index. So awk has a special kind of for statement for scanning an array:

for (var in array)
  body

This loop executes body once for each index in array that your program has previously used, with the variable var set to that index.

Here is a program that uses this form of the for statement. The first rule scans the input records and notes which words appear (at least once) in the input, by storing a one into the array used with the word as index. The second rule scans the elements of used to find all the distinct words that appear in the input. It prints each word that is more than 10 characters long, and also prints the number of such words. See section Built-in Functions for String Manipulation, for more information on the built-in function length.

# Record a 1 for each word that is used at least once.
{
    for (i = 1; i <= NF; i++)
        used[$i] = 1
}

# Find number of distinct words more than 10 characters long.
END {
    for (x in used)
        if (length(x) > 10) {
            ++num_long_words
            print x
        }
    print num_long_words, "words longer than 10 characters"
}

See section Generating Word Usage Counts, for a more detailed example of this type.

The order in which elements of the array are accessed by this statement is determined by the internal arrangement of the array elements within awk and cannot be controlled or changed. This can lead to problems if new elements are added to array by statements in the loop body; you cannot predict whether or not the for loop will reach them. Similarly, changing var inside the loop may produce strange results. It is best to avoid such things.

The delete Statement

You can remove an individual element of an array using the delete statement:

delete array[index]

Once you have deleted an array element, you can no longer obtain any value the element once had. It is as if you had never referred to it and had never given it any value.

Here is an example of deleting elements in an array:

for (i in frequencies)
  delete frequencies[i]

This example removes all the elements from the array frequencies.

If you delete an element, a subsequent for statement to scan the array will not report that element, and the in operator to check for the presence of that element will return zero (i.e. false):

delete foo[4]
if (4 in foo)
    print "This will never be printed"

It is important to note that deleting an element is not the same as assigning it a null value (the empty string, "").

foo[4] = ""
if (4 in foo)
  print "This is printed, even though foo[4] is empty"

It is not an error to delete an element that does not exist.

You can delete all the elements of an array with a single statement, by leaving off the subscript in the delete statement.

delete array

This ability is a gawk extension; it is not available in compatibility mode (see section Command Line Options).

Using this version of the delete statement is about three times more efficient than the equivalent loop that deletes each element one at a time.

The following statement provides a portable, but non-obvious way to clear out an array.

# thanks to Michael Brennan for pointing this out
split("", array)

The split function (see section Built-in Functions for String Manipulation) clears out the target array first. This call asks it to split apart the null string. Since there is no data to split out, the function simply clears the array and then returns.

Using Numbers to Subscript Arrays

An important aspect of arrays to remember is that array subscripts are always strings. If you use a numeric value as a subscript, it will be converted to a string value before it is used for subscripting (see section Conversion of Strings and Numbers).

This means that the value of the built-in variable CONVFMT can potentially affect how your program accesses elements of an array. For example:

xyz = 12.153
data[xyz] = 1
CONVFMT = "%2.2f"
if (xyz in data)
    printf "%s is in data\n", xyz
else
    printf "%s is not in data\n", xyz

This prints `12.15 is not in data'. The first statement gives xyz a numeric value. Assigning to data[xyz] subscripts data with the string value "12.153" (using the default conversion value of CONVFMT, "%.6g"), and assigns one to data["12.153"]. The program then changes the value of CONVFMT. The test `(xyz in data)' generates a new string value from xyz, this time "12.15", since the value of CONVFMT only allows two significant digits. This test fails, since "12.15" is a different string from "12.153".

According to the rules for conversions (see section Conversion of Strings and Numbers), integer values are always converted to strings as integers, no matter what the value of CONVFMT may happen to be. So the usual case of:

for (i = 1; i <= maxsub; i++)
    do something with array[i]

will work, no matter what the value of CONVFMT.

Like many things in awk, the majority of the time things work as you would expect them to work. But it is useful to have a precise knowledge of the actual rules, since sometimes they can have a subtle effect on your programs.

Using Uninitialized Variables as Subscripts

Suppose you want to print your input data in reverse order. A reasonable attempt at a program to do so (with some test data) might look like this:

$ echo 'line 1
> line 2
> line 3' | awk '{ l[lines] = $0; ++lines }
> END {
>     for (i = lines-1; i >= 0; --i)
>        print l[i]
> }'
-| line 3
-| line 2

Unfortunately, the very first line of input data did not come out in the output!

At first glance, this program should have worked. The variable lines is uninitialized, and uninitialized variables have the numeric value zero. So, the value of l[0] should have been printed.

The issue here is that subscripts for awk arrays are always strings. And uninitialized variables, when used as strings, have the value "", not zero. Thus, `line 1' ended up stored in l[""].

The following version of the program works correctly:

{ l[lines++] = $0 }
END {
    for (i = lines - 1; i >= 0; --i)
       print l[i]
}

Here, the `++' forces lines to be numeric, thus making the "old value" numeric zero, which is then converted to "0" as the array subscript.

As we have just seen, even though it is somewhat unusual, the null string ("") is a valid array subscript (d.c.). If `--lint' is provided on the command line (see section Command Line Options), gawk will warn about the use of the null string as a subscript.

Multi-dimensional Arrays

A multi-dimensional array is an array in which an element is identified by a sequence of indices, instead of a single index. For example, a two-dimensional array requires two indices. The usual way (in most languages, including awk) to refer to an element of a two-dimensional array named grid is with grid[x,y].

Multi-dimensional arrays are supported in awk through concatenation of indices into one string. What happens is that awk converts the indices into strings (see section Conversion of Strings and Numbers) and concatenates them together, with a separator between them. This creates a single string that describes the values of the separate indices. The combined string is used as a single index into an ordinary, one-dimensional array. The separator used is the value of the built-in variable SUBSEP.

For example, suppose we evaluate the expression `foo[5,12] = "value"' when the value of SUBSEP is "@". The numbers five and 12 are converted to strings and concatenated with an `@' between them, yielding "5@12"; thus, the array element foo["5@12"] is set to "value".

Once the element's value is stored, awk has no record of whether it was stored with a single index or a sequence of indices. The two expressions `foo[5,12]' and `foo[5 SUBSEP 12]' are always equivalent.

The default value of SUBSEP is the string "\034", which contains a non-printing character that is unlikely to appear in an awk program or in most input data.

The usefulness of choosing an unlikely character comes from the fact that index values that contain a string matching SUBSEP lead to combined strings that are ambiguous. Suppose that SUBSEP were "@"; then `foo["a@b", "c"]' and `foo["a", "b@c"]' would be indistinguishable because both would actually be stored as `foo["a@b@c"]'.

You can test whether a particular index-sequence exists in a "multi-dimensional" array with the same operator `in' used for single dimensional arrays. Instead of a single index as the left-hand operand, write the whole sequence of indices, separated by commas, in parentheses:

(subscript1, subscript2, ...) in array

The following example treats its input as a two-dimensional array of fields; it rotates this array 90 degrees clockwise and prints the result. It assumes that all lines have the same number of elements.

awk '{
     if (max_nf < NF)
          max_nf = NF
     max_nr = NR
     for (x = 1; x <= NF; x++)
          vector[x, NR] = $x
}

END {
     for (x = 1; x <= max_nf; x++) {
          for (y = max_nr; y >= 1; --y)
               printf("%s ", vector[x, y])
          printf("\n")
     }
}'

When given the input:

1 2 3 4 5 6
2 3 4 5 6 1
3 4 5 6 1 2
4 5 6 1 2 3

it produces:

4 3 2 1
5 4 3 2
6 5 4 3
1 6 5 4
2 1 6 5
3 2 1 6

Scanning Multi-dimensional Arrays

There is no special for statement for scanning a "multi-dimensional" array; there cannot be one, because in truth there are no multi-dimensional arrays or elements; there is only a multi-dimensional way of accessing an array.

However, if your program has an array that is always accessed as multi-dimensional, you can get the effect of scanning it by combining the scanning for statement (see section Scanning All Elements of an Array) with the split built-in function (see section Built-in Functions for String Manipulation). It works like this:

for (combined in array) {
  split(combined, separate, SUBSEP)
  ...
}

This sets combined to each concatenated, combined index in the array, and splits it into the individual indices by breaking it apart where the value of SUBSEP appears. The split-out indices become the elements of the array separate.

Thus, suppose you have previously stored a value in array[1, "foo"]; then an element with index "1\034foo" exists in array. (Recall that the default value of SUBSEP is the character with code 034.) Sooner or later the for statement will find that index and do an iteration with combined set to "1\034foo". Then the split function is called as follows:

split("1\034foo", separate, "\034")

The result of this is to set separate[1] to "1" and separate[2] to "foo". Presto, the original sequence of separate indices has been recovered.


Go to the first, previous, next, last section, table of contents.