Node:Translate Program, Next:Labels Program, Previous:Alarm Program, Up:Miscellaneous Programs
The system tr
utility transliterates characters. For example, it is
often used to map uppercase letters into lowercase for further processing:
generate data | tr 'A-Z' 'a-z' | process data ...
tr
requires two lists of characters.1 When processing the input, the first character in the
first list is replaced with the first character in the second list,
the second character in the first list is replaced with the second
character in the second list, and so on. If there are more characters
in the "from" list than in the "to" list, the last character of the
"to" list is used for the remaining characters in the "from" list.
Some time ago,
a user proposed that a transliteration function should
be added to gawk
.
The following program was written to
prove that character transliteration could be done with a user-level
function. This program is not as complete as the system tr
utility
but it does most of the job.
The translate
program demonstrates one of the few weaknesses
of standard awk
: dealing with individual characters is very
painful, requiring repeated use of the substr
, index
,
and gsub
built-in functions
(see String Manipulation Functions).2
There are two functions. The first, stranslate
, takes three
arguments:
from
to
target
Associative arrays make the translation part fairly easy. t_ar
holds
the "to" characters, indexed by the "from" characters. Then a simple
loop goes through from
, one character at a time. For each character
in from
, if the character appears in target
, gsub
is used to change it to the corresponding to
character.
The translate
function simply calls stranslate
using $0
as the target. The main program sets two global variables, FROM
and
TO
, from the command line, and then changes ARGV
so that
awk
reads from the standard input.
Finally, the processing rule simply calls translate
for each record:
# translate.awk --- do tr-like stuff # Bugs: does not handle things like: tr A-Z a-z, it has # to be spelled out. However, if `to' is shorter than `from', # the last character in `to' is used for the rest of `from'. function stranslate(from, to, target, lf, lt, t_ar, i, c) { lf = length(from) lt = length(to) for (i = 1; i <= lt; i++) t_ar[substr(from, i, 1)] = substr(to, i, 1) if (lt < lf) for (; i <= lf; i++) t_ar[substr(from, i, 1)] = substr(to, lt, 1) for (i = 1; i <= lf; i++) { c = substr(from, i, 1) if (index(target, c) > 0) gsub(c, t_ar[c], target) } return target } function translate(from, to) { return $0 = stranslate(from, to, $0) } # main program BEGIN { if (ARGC < 3) { print "usage: translate from to" > "/dev/stderr" exit } FROM = ARGV[1] TO = ARGV[2] ARGC = 2 ARGV[1] = "-" } { translate(FROM, TO) print }
While it is possible to do character transliteration in a user-level
function, it is not necessarily efficient, and we (the gawk
authors) started to consider adding a built-in function. However,
shortly after writing this program, we learned that the System V Release 4
awk
had added the toupper
and tolower
functions
(see String Manipulation Functions).
These functions handle the vast majority of the
cases where character transliteration is necessary, and so we chose to
simply add those functions to gawk
as well and then leave well
enough alone.
An obvious improvement to this program would be to set up the
t_ar
array only once, in a BEGIN
rule. However, this
assumes that the "from" and "to" lists
will never change throughout the lifetime of the program.
On some older
System V systems,
tr
may require that the lists be written as
range expressions enclosed in square brackets ([a-z]
) and quoted,
to prevent the shell from attempting a file name expansion. This is
not a feature.
This
program was written before gawk
acquired the ability to
split each character in a string into separate array elements.