[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9. Operating on characters

This commands operate on individual characters.

9.1 tr: Translate, squeeze, and/or delete characters    Translate, squeeze, and/or delete characters.

9.2 expand: Convert tabs to spaces    Convert tabs to spaces.

9.3 unexpand: Convert spaces to tabs    Convert spaces to tabs.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9.1 `tr`: Translate, squeeze, and/or delete characters

Synopsis:

tr [option]... set1 [set2]

tr copies standard input to standard output, performing one of the following operations:

translate, and optionally squeeze repeated characters in the result,
squeeze repeated characters,
delete characters,
delete characters, then squeeze repeated characters from the result.

The set1 and (if given) set2 arguments define ordered sets of characters, referred to below as set1 and set2. These sets are the characters of the input that tr operates on. The `--complement' (`-c') option replaces set1 with its complement (all of the characters that are not in set1).

9.1.1 Specifying sets of characters

9.1.2 Translating    Changing one characters to another.

9.1.3 Squeezing repeats and deleting

9.1.4 Warning messages

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9.1.1 Specifying sets of characters

The format of the set1 and set2 arguments resembles the format of regular expressions; however, they are not regular expressions, only lists of characters. Most characters simply represent themselves in these strings, but the strings can contain the shorthands listed below, for convenience. Some of them can be used only in set1 or set2, as noted below.

Backslash escapes

A backslash followed by a character not listed below causes an error message.

`\a': Control-G.
`\b': Control-H.
`\f': Control-L.
`\n': Control-J.
`\r': Control-M.
`\t': Control-I.
`\v': Control-K.
`\ooo': The character with the value given by ooo, which is 1 to 3 octal digits,
`\\': A backslash.

Ranges

The notation `m-n' expands to all of the characters from m through n, in ascending order. m should collate before n; if it doesn't, an error results. As an example, `0-9' is the same as `0123456789'.

GNU tr does not support the System V syntax that uses square brackets to enclose ranges. Translations specified in that format sometimes work as expected, since the brackets are often transliterated to themselves. However, they should be avoided because they sometimes behave unexpectedly. For example, `tr -d '[0-9]'' deletes brackets as well as digits.

Many historically common and even accepted uses of ranges are not portable. For example, on EBCDIC hosts using the `A-Z' range will not do what most would expect because `A' through `Z' are not contiguous as they are in ASCII. If you can rely on a POSIX compliant version of tr, then the best way to work around this is to use character classes (see below). Otherwise, it is most portable (and most ugly) to enumerate the members of the ranges.

Repeated characters

The notation `[c*n]' in set2 expands to n copies of character c. Thus, `[y*6]' is the same as `yyyyyy'. The notation `[c*]' in string2 expands to as many copies of c as are needed to make set2 as long as set1. If n begins with `0', it is interpreted in octal, otherwise in decimal.

Character classes

The notation `[:class:]' expands to all of the characters in the (predefined) class class. The characters expand in no particular order, except for the upper and lower classes, which expand in ascending order. When the `--delete' (`-d') and `--squeeze-repeats' (`-s') options are both given, any character class can be used in set2. Otherwise, only the character classes lower and upper are accepted in set2, and then only if the corresponding character class (upper and lower, respectively) is specified in the same relative position in set1. Doing this specifies case conversion. The class names are given below; an error results when an invalid class name is given.

alnum: Letters and digits.
alpha: Letters.
blank: Horizontal whitespace.
cntrl: Control characters.
digit: Digits.
graph: Printable characters, not including space.
lower: Lowercase letters.
print: Printable characters, including space.
punct: Punctuation characters.
space: Horizontal or vertical whitespace.
upper: Uppercase letters.
xdigit: Hexadecimal digits.

Equivalence classes

The syntax `[=c=]' expands to all of the characters that are equivalent to c, in no particular order. Equivalence classes are a relatively recent invention intended to support non-English alphabets. But there seems to be no standard way to define them or determine their contents. Therefore, they are not fully implemented in GNU tr; each character's equivalence class consists only of that character, which is of no particular use.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9.1.2 Translating

tr performs translation when set1 and set2 are both given and the `--delete' (`-d') option is not given. tr translates each character of its input that is in set1 to the corresponding character in set2. Characters not in set1 are passed through unchanged. When a character appears more than once in set1 and the corresponding characters in set2 are not all the same, only the final one is used. For example, these two commands are equivalent:

tr aaa xyz tr a z

A common use of tr is to convert lowercase characters to uppercase. This can be done in many ways. Here are three of them:

tr abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ tr a-z A-Z tr '[:lower:]' '[:upper:]'

But note that using ranges like a-z above is not portable.

When tr is performing translation, set1 and set2 typically have the same length. If set1 is shorter than set2, the extra characters at the end of set2 are ignored.

On the other hand, making set1 longer than set2 is not portable; POSIX says that the result is undefined. In this situation, BSD tr pads set2 to the length of set1 by repeating the last character of set2 as many times as necessary. System V tr truncates set1 to the length of set2.

By default, GNU tr handles this case like BSD tr. When the `--truncate-set1' (`-t') option is given, GNU tr handles this case like the System V tr instead. This option is ignored for operations other than translation.

Acting like System V tr in this case breaks the relatively common BSD idiom:

tr -cs A-Za-z0-9 '\012'

because it converts only zero bytes (the first element in the complement of set1), rather than all non-alphanumerics, to newlines.

By the way, the above idiom is not portable because it uses ranges. Assuming a POSIX compliant tr, here is a better way to write it:

tr -cs '[:alnum:]' '[\n*]'

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9.1.3 Squeezing repeats and deleting

When given just the `--delete' (`-d') option, tr removes any input characters that are in set1.

When given just the `--squeeze-repeats' (`-s') option, tr replaces each input sequence of a repeated character that is in set1 with a single occurrence of that character.

When given both `--delete' and `--squeeze-repeats', tr first performs any deletions using set1, then squeezes repeats from any remaining characters using set2.

The `--squeeze-repeats' option may also be used when translating, in which case tr first performs translation, then squeezes repeats from any remaining characters using set2.

Here are some examples to illustrate various combinations of options:

Remove all zero bytes:
tr -d '\000'
Put all words on lines by themselves. This converts all non-alphanumeric characters to newlines, then squeezes each string of repeated newlines into a single newline:
tr -cs '[:alnum:]' '[\n*]'
Convert each sequence of repeated newlines to a single newline:
tr -s '\n'
Find doubled occurrences of words in a document. For example, people often write "the the" with the duplicated words separated by a newline. The bourne shell script below works first by converting each sequence of punctuation and blank characters to a single newline. That puts each "word" on a line by itself. Next it maps all uppercase characters to lower case, and finally it runs uniq with the `-d' option to print out only the words that were adjacent duplicates.
#!/bin/sh cat "$@" \ | tr -s '[:punct:][:blank:]' '\n' \ | tr '[:upper:]' '[:lower:]' \ | uniq -d
Deleting a small set of characters is usually straightforward. For example, to remove all `a's, `x's, and `M's you would do this:
tr -d axM
However, when `-' is one of those characters, it can be tricky because `-' has special meanings. Performing the same task as above but also removing all `-' characters, we might try tr -d -axM, but that would fail because tr would try to interpret `-a' as a command-line option. Alternatively, we could try putting the hyphen inside the string, tr -d a-xM, but that wouldn't work either because it would make tr interpret a-x as the range of characters `a'...`x' rather than the three. One way to solve the problem is to put the hyphen at the end of the list of characters:
tr -d axM-
More generally, use the character class notation [=c=] with `-' (or any other character) in place of the `c':
tr -d '[=-=]axM'
Note how single quotes are used in the above example to protect the square brackets from interpretation by a shell.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9.1.4 Warning messages

Setting the environment variable POSIXLY_CORRECT turns off the following warning and error messages, for strict compliance with POSIX. Otherwise, the following diagnostics are issued:

When the `--delete' option is given but `--squeeze-repeats' is not, and set2 is given, GNU tr by default prints a usage message and exits, because set2 would not be used. The POSIX specification says that set2 must be ignored in this case. Silently ignoring arguments is a bad idea.
When an ambiguous octal escape is given. For example, `\400' is actually `\40' followed by the digit `0', because the value 400 octal does not fit into a single byte.

GNU tr does not provide complete BSD or System V compatibility. For example, it is impossible to disable interpretation of the POSIX constructs `[:alpha:]', `[=c=]', and `[c*10]'. Also, GNU tr does not delete zero bytes automatically, unlike traditional Unix versions, which provide no way to preserve zero bytes.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9.2 `expand`: Convert tabs to spaces

expand writes the contents of each given file, or standard input if none are given or for a file of `-', to standard output, with tab characters converted to the appropriate number of spaces. Synopsis:

expand [option]... [file]...

By default, expand converts all tabs to spaces. It preserves backspace characters in the output; they decrement the column count for tab calculations. The default action is equivalent to `-t 8' (set tabs every 8 columns).

The program accepts the following options. Also see 2. Common options.

`-t tab1[,tab2]...'

`--tabs=tab1[,tab2]...'

If only one tab stop is given, set the tabs tab1 spaces apart (default is 8). Otherwise, set the tabs at columns tab1, tab2, ... (numbered from 0), and replace any tabs beyond the last tabstop given with single spaces. Tabstops can be separated by blanks as well as by commas.

On older systems, expand supports an obsolete option `-tab1[,tab2]...', where tabstops must be separated by commas. POSIX 1003.1-2001 (see section 2.5 Standards conformance) does not allow this; use `-t tab1[,tab2]...' instead.

`-i'

`--initial'

Only convert initial tabs (those that precede all non-space or non-tab characters) on each line to spaces.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

9.3 `unexpand`: Convert spaces to tabs

unexpand writes the contents of each given file, or standard input if none are given or for a file of `-', to standard output, with strings of two or more space or tab characters converted to as many tabs as possible followed by as many spaces as are needed. Synopsis:

unexpand [option]... [file]...

By default, unexpand converts only initial spaces and tabs (those that precede all non space or tab characters) on each line. It preserves backspace characters in the output; they decrement the column count for tab calculations. By default, tabs are set at every 8th column.

The program accepts the following options. Also see 2. Common options.

`-t tab1[,tab2]...'

`--tabs=tab1[,tab2]...'

If only one tab stop is given, set the tabs tab1 spaces apart instead of the default 8. Otherwise, set the tabs at columns tab1, tab2, ... (numbered from 0), and leave spaces and tabs beyond the tabstops given unchanged. Tabstops can be separated by blanks as well as by commas. This option implies the `-a' option.

On older systems, unexpand supports an obsolete option `-tab1[,tab2]...', where tabstops must be separated by commas. (Unlike `-t', this obsolete option does not imply `-a'.) POSIX 1003.1-2001 (see section 2.5 Standards conformance) does not allow this; use `--first-only -t tab1[,tab2]...' instead.

`-a'

`--all'

Convert all strings of two or more spaces or tabs, not just initial ones, to tabs.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Jeff Bailey on December, 28 2002 using texi2html

9.1 `tr`: Translate, squeeze, and/or delete characters		Translate, squeeze, and/or delete characters.
9.2 `expand`: Convert tabs to spaces		Convert tabs to spaces.
9.3 `unexpand`: Convert spaces to tabs		Convert spaces to tabs.

9.1.1 Specifying sets of characters
9.1.2 Translating		Changing one characters to another.
9.1.3 Squeezing repeats and deleting
9.1.4 Warning messages

9. Operating on characters

9.1 tr: Translate, squeeze, and/or delete characters

9.1.1 Specifying sets of characters

9.1.2 Translating

9.1.3 Squeezing repeats and deleting

9.1.4 Warning messages

9.2 expand: Convert tabs to spaces

9.3 unexpand: Convert spaces to tabs

9.1 `tr`: Translate, squeeze, and/or delete characters

9.2 `expand`: Convert tabs to spaces

9.3 `unexpand`: Convert spaces to tabs