Node:sentence-end, Next:, Previous:Regexp Search, Up:Regexp Search



12.1 The Regular Expression for sentence-end

The symbol sentence-end is bound to the pattern that marks the end of a sentence. What should this regular expression be?

Clearly, a sentence may be ended by a period, a question mark, or an exclamation mark. Indeed, only clauses that end with one of those three characters should be considered the end of a sentence. This means that the pattern should include the character set:

[.?!]

However, we do not want forward-sentence merely to jump to a period, a question mark, or an exclamation mark, because such a character might be used in the middle of a sentence. A period, for example, is used after abbreviations. So other information is needed.

According to convention, you type two spaces after every sentence, but only one space after a period, a question mark, or an exclamation mark in the body of a sentence. So a period, a question mark, or an exclamation mark followed by two spaces is a good indicator of an end of sentence. However, in a file, the two spaces may instead be a tab or the end of a line. This means that the regular expression should include these three items as alternatives.

This group of alternatives will look like this:

\\($\\| \\|  \\)
       ^   ^^
      TAB  SPC

Here, $ indicates the end of the line, and I have pointed out where the tab and two spaces are inserted in the expression. Both are inserted by putting the actual characters into the expression.

Two backslashes, \\, are required before the parentheses and vertical bars: the first backslash quotes the following backslash in Emacs; and the second indicates that the following character, the parenthesis or the vertical bar, is special.

Also, a sentence may be followed by one or more carriage returns, like this:

[
]*

Like tabs and spaces, a carriage return is inserted into a regular expression by inserting it literally. The asterisk indicates that the <RET> is repeated zero or more times.

But a sentence end does not consist only of a period, a question mark or an exclamation mark followed by appropriate space: a closing quotation mark or a closing brace of some kind may precede the space. Indeed more than one such mark or brace may precede the space. These require a expression that looks like this:

[]\"')}]*

In this expression, the first ] is the first character in the expression; the second character is ", which is preceded by a \ to tell Emacs the " is not special. The last three characters are ', ), and }.

All this suggests what the regular expression pattern for matching the end of a sentence should be; and, indeed, if we evaluate sentence-end we find that it returns the following value:

sentence-end
     => "[.?!][]\"')}]*\\($\\|     \\|  \\)[
]*"