[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

33.1 Text Representations

Emacs has two text representations---two ways to represent text in a string or buffer. These are called unibyte and multibyte. Each string, and each buffer, uses one of these two representations. For most purposes, you can ignore the issue of representations, because Emacs converts text between them as appropriate. Occasionally in Lisp programming you will need to pay attention to the difference.

In unibyte representation, each character occupies one byte and therefore the possible character codes range from 0 to 255. Codes 0 through 127 are ASCII characters; the codes from 128 through 255 are used for one non-ASCII character set (you can choose which character set by setting the variable nonascii-insert-offset).

In multibyte representation, a character may occupy more than one byte, and as a result, the full range of Emacs character codes can be stored. The first byte of a multibyte character is always in the range 128 through 159 (octal 0200 through 0237). These values are called leading codes. The second and subsequent bytes of a multibyte character are always in the range 160 through 255 (octal 0240 through 0377); these values are trailing codes.

Some sequences of bytes are not valid in multibyte text: for example, a single isolated byte in the range 128 through 159 is not allowed. But character codes 128 through 159 can appear in multibyte text, represented as two-byte sequences. All the character codes 128 through 255 are possible (though slightly abnormal) in multibyte text; they appear in multibyte buffers and strings when you do explicit encoding and decoding (see section 33.10.7 Explicit Encoding and Decoding).

In a buffer, the buffer-local value of the variable enable-multibyte-characters specifies the representation used. The representation for a string is determined and recorded in the string when the string is constructed.

Variable: enable-multibyte-characters
This variable specifies the current buffer's text representation. If it is non-nil, the buffer contains multibyte text; otherwise, it contains unibyte text.

You cannot set this variable directly; instead, use the function set-buffer-multibyte to change a buffer's representation.

Variable: default-enable-multibyte-characters
This variable's value is entirely equivalent to (default-value 'enable-multibyte-characters), and setting this variable changes that default value. Setting the local binding of enable-multibyte-characters in a specific buffer is not allowed, but changing the default value is supported, and it is a reasonable thing to do, because it has no effect on existing buffers.

The `--unibyte' command line option does its job by setting the default value to nil early in startup.

Function: position-bytes position
Return the byte-position corresponding to buffer position position in the current buffer.

Function: byte-to-position byte-position
Return the buffer position corresponding to byte-position byte-position in the current buffer.

Function: multibyte-string-p string
Return t if string is a multibyte string.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

This document was generated on May 2, 2002 using texi2html