GREP is a filter that searches input files, or the standard input, for lines that contain matches for one or more patterns called regular expressions and displays those matching lines. GREP can also search binary files and display records or buffers that contain matches.
This is a reference manual, including all the command-line options and a detailed description of regular expressions. For an overview of GREP, please read the user guide first. (A full revision history is also available.)
Four sections below describe the options in detail, by functional groups: input file options, pattern-matching options, output options, and general options.
| |
Include hidden and system files when expanding wild cards (* and ?) in file specifications. Without this option, GREP will ignore hidden and system files while searching for files that match a wild card. However, if you explicitly specify a file on the command line, GREP will always read it even if it's a hidden or system file. The |
| |||||||||||||||||||||||||
Process named input files as text or binary. (Please see Binary Files and Text Files for detailed information about the differences.) You can choose from
Setting the Up through release 6.0, GREP had a single mixed binary mode
controlled with the single-letter Only named input files are read in binary mode. Regardless of the
|
| |
Please see the section on subdirectory searches in the user guide. |
| |
Expect text lines up to txwid characters long, or process binary files in records or buffers of bnwid bytes. (If you specify only one number, it's used for both txwid and bnwid.)
txwid and bnwid default to 4096 in GREP32, and you can specify
anything from 2 to 2147483645; the default for GREP16 is 256
and you can specify 2 to 32765.
(The widths are also limited to available memory, which will
depend on your system configuration, what other programs you have
running at the time, and what you specify with the
(For full details of binary and text file modes, please see that section in the user guide.) Text mode ( The CR/LF (ASCII 13 or 10 or both) at the end of line don't count against the specified txwid. If GREP reads a long line from the input, it will break it after txwid+1 characters and treat the remainder as a separate line. The whole line gets scanned, but any match that starts before the break and ends after the break will be missed. Therefore, if possible you should set txwid large enough to hold the longest line in the file. If GREP does find any lines longer than the specified or
default txwid, it will display a warning message at the end of
execution, telling you the length of the longest line.
(This warning is
suppressed by the Record-oriented binary mode ( Files are read in binary mode, in records of bnwid bytes. Free-form binary mode ( Files are read in binary mode, in buffers of bnwid bytes.
bnwid must be an even number.
The recommended value of bnwid is at least twice the longest
string you expect to find. For instance, if you're searching for a
regex that might match up to 40 characters, you want to specify
An internal procedure ensures that if a match exists in the file it will be found, provided the match is not longer than half the buffer. (As always, if one buffer contains multiple matches only the first match in that buffer will be counted.) When GREP chooses file mode ( txwid is used as a line width for any file that is treated
as a text file, and bnwid is used as buffer width for
any file that is treated as free-form binary.
bnwid must be an even number.
If you specify only one number with |
| |||||||||||||||||
This option tells GREP how to interpret the regex(es) you enter on the command line, from keyboard, or in a file.
Basic and extended regexes are fully explained under Regular Expressions, later. An extended regex supports all the features of a basic regex plus the quantifiers ? and {...}, alternatives |, subexpressions (...), some special constructs with the backslash \, and more. The If you never specify the |
| |
GREP reads one or more regexes from file instead of taking a single regex from the command line, and reports lines from the input file(s) that match any of the regexes read from file. You must enter the regexes one per line in the file; don't put quotes around them. An empty file contains no regexes, and therefore matches nothing. file must follow the When you supply two or more regexes, GREP normally reports each line
from the input file that matches any
(at least one) of the regexes.
If you set the
(The |
| |
Ignore case, treating caps and lower case as matching each other. Caution: By default, the In GREP16, the |
| |
Set the character mapping or locale. This option is available only in GREP32, because Microsoft 16-bit C does not support setting the locale. There are three issues with locale: binary output, case-blind matching, and character classes. Details about all three are given below, after the list of mappings. While many mappings (locales) are supported in GREP32, for our purposes most are duplicates. The six unique locales are
The recommended strategy is to put an First, when displaying file contents in binary mode, GREP displays each non-printing character as the four-byte sequence <nn>, where nn is the hexadecimal value of the character. GREP32 uses the current mapping to decide what is and is not a printing character. Second, in case-blind matching ( Finally, character types and
character class names
in extended regular expressions ( You can use the supplied file |
| |
Show or count the lines that don't match instead of those that do. For the effect of the The |
| |
When multiple regexes are being sought
( For example, if you use the The quick brown fox I see a brown smudge Crazy like a fox The fox's tail is brown But if you also use the As you see from the example, with the While not actually forbidden, the For the effect of the |
Before going through the output options, let's take a moment to look at some of the possible output formats. By default, GREP's output is similar to that of DOS FIND:
---------- GREP.C op_showhead = ShowNoHeads; else if (op_showhead == ShowNoHeads) op_showhead = ShowNoHeads; ---------- GREP_MAT.C op_showhead == ShowNoHeads)However, the
/U
option
produces UNIX grep-style output like this:
GREP.C: op_showhead = ShowNoHeads; GREP.C: else if (op_showhead == ShowNoHeads) GREP.C: op_showhead = ShowNoHeads; GREP_MAT.C: op_showhead == ShowNoHeads)As you can see, the main difference is that DOS-style output has the filename as a header above the group of matching lines from that file, and UNIX-style output has the name of the file on every matching line.
The output options give you a lot of control over what GREP produces, but they can be confusing. Here's the executive summary:
/P
option), just the matching lines
(default), just the matching portions of lines
(/J
option),
just a count of matching lines by file
(/C
option), or just the names of
files that contain matches (/L
option).
/B
option),
only the names of files that contain matches (default),
or no filename headers at all (/H
option).
/U
option)
and/or the line number (/N
option).
Now, in alphabetical order, here are the options that control what GREP outputs and how it is formatted.
| |
Display a header for every file examined, even if the file contains
no matches. (This option is meaningful only with DOS-style output, when the
|
| |
Display only a count of the matching lines in each file, instead of the matching lines themselves. Lines are counted, not matches. If a match occurs several
times on a line, or several regexes match the same line,
the line is counted only once. You cannot use the (For binary files, read "record or buffer" for "line". For free-form binary, the buffer size may affect how many matching buffers are found, since multiple occurrences in one buffer are counted only once.) |
| |
Don't display any filenames as headers. The grep /H "Directory" <inputfile | other program If you want to keep the file name with each extracted line, use the
|
| |
Display just the portion of each line that matches the input regex, not the whole line containing a match. If a given line contains multiple occurrences of the regex, only the first occurrence will be displayed. The If you specify multiple regexes
(
The |
| |
Display only a bare list of the names of files that contain matches, not the actual lines that match. The |
| |
Show the line number before each matching line. DOS-style output with
the ---------- GREP.C [ 144] op_showhead = ShowNoHeads; [ 178] else if (op_showhead == ShowNoHeads) [ 366] op_showhead = ShowNoHeads; ---------- GREP_MAT.C [ 98] op_showhead == ShowNoHeads) With GREP.C:144: op_showhead = ShowNoHeads; GREP.C:178: else if (op_showhead == ShowNoHeads) GREP.C:366: op_showhead = ShowNoHeads; GREP_MAT.C:98: op_showhead == ShowNoHeads) UNIX-style output is suitable for use with the excellent freeware editor Vim. When displaying a buffer from a
free-format binary file -- either
under the |
| |
Show context lines before and after each match. If you omit
after, GREP will show the same number of lines after each match
as before. Plain Either number can be 0. For instance, use If you use the ---------- GREP.C 143 if (opcount >= argc) [ 144] op_showhead = ShowNoHeads; 145 177 PRTDBG "with each matching line"); [ 178] else if (op_showhead == ShowNoHeads) 179 PRTDBG "NO"); 365 if (myToggle('L') || myToggle('U') || myToggle('H')) [ 366] op_showhead = ShowNoHeads; 367 else if (myToggle('B')) ---------- GREP_MAT.C 97 op_showwhat == ShowMatchCount || [ 98] op_showhead == ShowNoHeads) 99 headered = TRUE; As you can see, the actual matches have square brackets around the
line numbers, and the context lines do not. (In UNIX format, with the
Interactions between the
GREP16 has to allocate space for the preview lines within the
same 64 K data segment as all other data. Consequently, if you
specify a moderately large value, particularly with a large line
width ( |
| |
Show the filename with each matching line, instead of just once in a separate header. This UNIX-style output is useful with editors like Vim that can automatically jump to the file that contains a match. Some examples of UNIX-style output were given at the beginning of this section. There's one small difference from UNIX grep output: UNIX grep
suppresses the filename when there is only one input file, but GREP
assumes that if you didn't want the filename you wouldn't have
specified the |
In addition to these options, under the
/R2
or /R3
option
GREP reads files in binary
mode, and that has a side effect on the output format.
Some combinations of output options are logically incompatible. For
instance, /H/L
makes no sense
(don't list filenames, and
list the names of files that contain matches).
In such cases, GREP will
turn off one of the incompatible options and tell you what it did
(unless you suppress such messages with
the /Q2
or /Q3
option).
The incompatibilities are just common sense, but are listed here for
completeness:
/B
| overrides /H but is
ignored with /L or /U
|
/C
| overrides /H , /J , /L , /N , /P
|
/H
| ignored with /B , /C , /L , /U
|
/J
| ignored with /C , /L
|
/L
| overrides /B , /H , /J , /N ,
/P , /U but is
ignored with /C
|
/N
| ignored with /C or /L
|
/P
| ignored with /C or /L
|
/U
| overrides /B and /H but is
ignored with /L
|
| |||||||||||||
Debugging information includes whether you're running GREP16 or GREP32, whether the program is registered, the contents of the environment variable, the values of all options specified or implied, the files specified, the raw and interpreted values of the regex(es), details of every file scanned, execution timings, and more. This information is normally suppressed, but you may find it helpful if GREP seems to behave in a way you don't expect or if you have a bug report. Since the debugging information can be voluminous, if you want to see it at all you will usually want to specify an output file:
You can weed through the debugging output to some extent. GREP writes the following unique strings on most lines of output, so you can send debug output to a file and then grep the file for
|
| |||||||||||||||||
Set the quietness level, to suppress messages you may not want to see.
Fatal error messages (those that force GREP to stop execution)
will always be displayed. Debug output will also be displayed, if you set the
For compatibility with earlier releases of GREP, you can still specify
a plain (The |
| |
Reset all options to their default values. If you use the The |
| |
These options control the values that GREP returns in the
DOS error level. |
| |
Display a help message and summary of
options and regex forms,
then exit with no further processing. The help message is longer than 25 lines,
so you probably want to pipe it through
grep /? | more
You can also redirect this information. For instance, grep /? >prnwill send the help text to the printer. |
If you use certain options frequently, with
the registered version of GREP you can put them in the
ORS_GREP
environment variable. You have the same freedom
as on the command line: leading slashes or hyphens, space separation
or options run together, caps or lower case.
Only options can be put in the environment variable. If you want to
store a regex, put it in a file and put
/F
file
in the environment variable.
If you have some options in the ORS_GREP
environment
variable but you don't want one of them for a particular run of GREP,
you don't have to edit the environment variable. You can make most
changes on the command line, like this:
The /Z
option on the
command line makes GREP disregard the environment variable (as well as
any preceding options on the command line).
The numeric options /0
and /1
, which set
return values from GREP, override each other. The latest one specified
in the environment variable or on the command line will be effective.
/D
,
/E
,
/F
,
/M
,
/P
,
/Q
,
/R
, and
/W
in the environment variable can be overridden by
different settings on the command line. Use
/P0
to request no context lines. (If /D
and
/F
are set in the environment variable, you can specify
different files for them [including -
] on the command
line, but to clear them completely you must use the
/Z
option.)
The other single-letter options -- namely,
/A
,
/B
,
/C
,
/H
,
/I
,
/J
,
/L
,
/N
,
/S
,
/U
,
/V
, and
/Y
--
function as toggles, but a "+
" suffix will turn them
definitely on.
Extended example: Suppose you have set the environment variable as
set ORS_GREP=/UNIbecause you usually run GREP with UNIX-style output (
/U
option)
with line numbers (/N
option),
ignoring case of letters (/I
option).
If you want to run case sensitive for one particular run of GREP,
simply put the /I
option on the command line to reverse
the setting from the environment variable.
If you don't know what's in
the environment variable, perhaps because you're on an unfamiliar
machine, either put the /Z
option
on the command line followed by the options you want, or set them
positively by specifying for instance /L+
.
Finally, if you want to turn an option definitely off, without
regard to the environment variable, turn it on and then toggle it. To
turn off line numbers, /N+N
will always work, whether
N
was set in the environment variable or not.
(/N-
might be more logical, but for historical reasons
options with leading minus signs are allowed to run together, and such
a usage would conflict.)
If you're ever in doubt about the interaction of options between
the command line and the environment variable, simply add
"/d- | more
" to the end of your command line
and GREP will tell you all the option settings in effect and how it
interprets your regex.
A regular expression or regex is a pattern of characters that will be compared to lines from one or more input files. A line from an input file is a match if the line, or part of it, agrees with the pattern in the regex.
A regex can be a simple text string, like mother
, or
it can include a bunch of special characters to express possibilities
like "repeated" and "any of these characters or substrings".
(If you want to search only for simple
strings, use the /E0
option and
ignore all this regex stuff.)
The rest of this reference manual tells you how to construct regular expressions. So much detail can be overwhelming on a first or even a second reading; therefore, you may want to begin by ignoring everything about extended regexes. You may also want to refer back to the above examples periodically. On the other hand, if you're already comfortable with regexes, you'll find additional material and tips in Mastering Regular Expressions by Jeffrey Friedl (O'Reilly & Associates).
A regex is a mix of normal characters and special characters. Here's an overview of the special characters, with hyperlinks to the places in this reference manual where they are discussed in detail.
The following characters are special if they occur outside of square brackets:
\
backslash (treat special character as normal)
\
backslash (character
types, simple assertions,
back references,
character encoding, extended regex only)
.
period (matches any character)
*
asterisk (0 or more occurrences)
+
plus sign (1 or more occurrences)
?
question mark (0 or 1 occurrence, extended regex only)
{
left brace (repetition count,
extended regex only)
[
left square bracket (start character class)
^
caret (match start of line)
$
dollar sign (match end of line)
|
vertical bar (alternatives,
extended regex only)
(...)
parentheses or round brackets
(subexpressions, extended regex only)
The following characters are special if they occur within square brackets:
^
caret (negate
the character class)
\
backslash (treat special character as normal)
\
backslash (character encoding, extended regex only)
-
minus sign or hyphen
(character range)
[:
left square bracket followed by colon (introduce a
named character class, extended regex only)
]
right square bracket (end
character class)
Otherwise, every character is a normal character. Any of the above characters also becomes a normal character if preceded by a backslash, as will be shown below.
gr[ea]y
matches both the English and American
spellings of the word for the color between white and black.
moth[a-z]*
matches any word containing the letters
"moth" followed by any number of letters a through z. Yes, that
includes "moth" itself: see * or + for
Repetition below.
"[a-z]+"
matches a word in double quotes.
Read it as "a double quote mark, followed by one or more letters,
followed by another double quote mark."
[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9] matches a U.S. local
telephone number, which is three digits, followed by a hyphen,
followed by four digits. (You could express it more simply with an extended regex as [0-9]{3}-[0-9]{4}
or
even as \d{3}-\d{4}
.)
GREP offers two levels of regular expressions. Extended regexes have the greater power, but at a price: in some circumstances they can be slow in searches. Basic regexes offer a "core subset" of the regex capabilities. The discussion below will mark certain features as "extended regex"; all others are common to basic and extended regexes.
You'll see a full rundown on extended regexes below, but some of
the features they offer are | alternatives,
{ } quantifiers, and
( ) subexpressions. If
you want to use extended regexes, specify the
/E2
option,
available only in GREP32.
Normally, GREP treats your regexes as basic, since that's the only kind there was before release 6.0. Special characters listed below as "extended regex" are treated as normal characters in basic regexes.
Support for extended regular expressions, added to GREP release 6.0, is provided by the PCRE library package, release 3.5, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England. The primary site for PCRE is <ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/>. (GREP's price did not change upon the addition of this new capability.)
Different utilities define regexes differently; the following sections tell you how this GREP defines them. You can find fascinating tables of different interpretations in Jeffrey Friedl's book Mastering Regular Expressions (pages 63 and 182-183 of the 1997 edition).
A note to UNIX or Vim veterans:
This GREP follows the Perl or egrep scheme, which uses |
not
\|
for alternatives, ( )
not
\( \)
for subexpressions, \b
not \<
\>
for word boundaries. Be
alert to differences from the scheme you may know.
Any normal character matches itself. Example: the regex
abc
matches input lines that contain the three
consecutive characters a, b, and c.
You can use any character from space through character 255. When using 8-bit characters or certain special characters on the command line, see Special Rules for the Command Line below.
If you specify the /I
option, any
letter in your regex will match both the upper and lower case
of that letter. (By default, only unaccented English letters A-Z and a-z are
affected by the /I
option. In GREP32, you can use the
/M
option to select a mapping that
includes all letters.)
If you want to match a special character, you must precede it with
a backslash \
in your
regex. Example: to search for the string "^abc\def", you must put
backslashes before the two special characters to make GREP treat them
as normal characters and not give them special meanings: use
\^abc\\def
as your regex.
The period (full stop or dot) in a regex normally matches
any character. Example: o.e
matches lines that contain
"ode", "one", "ope", "ore", and "owe". Of course it also matches lines
that contain "oae", "o e", "o$e", "o´e", and so on.
If you want to match a literal period, for instance to search for
"3.50", you need a backslash before the
period in your regex to turn it into a normal character
(3\.50
).
In binary mode, the period matches any character including ASCII 0, Ctrl-Z, carriage return, and line feed. In text mode, ASCII 0 and the rest of the line are ignored; Ctrl-Z is end of file, and carriage return or line feed marks a line break.
A period between square brackets is just a normal character. For example, [.?!] matches any of the characters that end an unquoted sentence.
A plus sign (+
) after a character,
character class,
subexpression, or
back reference
matches one or more occurrences; an asterisk
(*
) matches zero or more occurrences.
In other words, the plus sign means "one or more" and the
asterisk means "any number, including none at all".
(The note on greediness below applies
to *
and +
in extended regexes.)
Example: Big.*night
matches lines that contain
"Big" followed by any number of any character followed by "night".
Since "any number" includes "zero", that regex also matches lines
that contain "Bignight".
Examples: snor+ing
matches lines that contain
"snoring", "snorring", "snorrring", and so on, but not "snoing".
snor*ing
matches "snoing", "snoring", and so on.
Used with a character class or
character type, the plus
sign and asterisk match any multiple characters in the class, not only
multiple occurrences of the same character. For instance,
sno[rw]+ing
matches lines that contain "snowing",
"snorwing", "snowrring", and so on.
Obligatory example: [A-Za-z_][A-Za-z0-9_]*
matches a C
or C++ identifier, which is an English letter or underscore, possibly
followed by any number of letters, digits, and underscores.
(The square brackets enclose character
classes.)
Anything followed by *
will always match. For example,
the regex .*
would match any number of characters
including none, meaning that empty and non-empty lines would match.
.*
is more useful as part of a regex.
Between square brackets, + and * are
normal characters. For instance, the regex
[*+]
will match either of the two common footnote
characters.
In an extended regex only, the question mark after a character,
character class,
subexpression, or
back reference
indicates that the construct is optional.
For example, the extended regex move?able
matches
lines containing "moveable" and "movable", but not "moveeable".
(The note on greediness below applies
to ?
in extended regexes.)
Anything followed by ?
will always match. For example,
the extended regex .?
would match one character or none.
Since every line contains a string of no characters (whether or not
there are some additional characters on the line), every line would be
a match. ?
is more useful as part of an extended
regex.
?
is a normal character when it occurs within square brackets in an extended regex; it's
always a normal character in a basic regex.
In an extended regex only, you can use braces (also called
curly braces) after a character,
character class,
subexpression, or
back reference
to specify repetition. The general form is
{
minimum,maximum}
where both
numbers are in the range 0 to 65535 and minimum <=
maximum. Here are the three variations:
Specify a minimum and maximum number of repetitions:
Aa{1,5}
matches "Aa", "Aaa", "Aaaa", "Aaaaa", or
"Aaaaaa".
Specify an exact number of repetitions:
[0-9]{4}
matches four consecutive digits (not necessarily the
same digit four times).
Specify a minimum number of repetitions:
^.{5,}$
matches lines that contain at least five
characters.
Three special cases of quantifiers have already been discussed.
The asterisk *
is equivalent
to {0,}
;
the plus sign +
is
equivalent to {1,}
; and
the question mark ?
is
equivalent to {0,1}
.
The braces are normal characters in other
contexts. For instance, {,3}
is just four normal
characters because it doesn't match any of the three variations listed
above. The braces are always normal characters inside
square brackets, and the right brace on its
own is always a normal character.
Both braces are normal characters anywhere in a basic regex.
(This is an advanced topic, probably best skipped on the first few readings of this reference manual.)
In an extended regex, you can control the "greediness" of the
quantifiers { }
, ?
, *
, and
+
. Normally they are all greedy, but you can make them
ungreedy by putting a question mark after them:
{ }?
, ??
, *?
, and
+?
.
What does it mean to say that quantifiers are greedy?
It means that they will
match as much as possible of an input line without causing the regex
to fail. For example, consider the extended regex
<.*>
when it tries to match against some HTML like
this:
<hr><p align=center>some text <b>in bold</b> and some not
Since the *
is greedy, it consumes every character
from the "h" in column 2 to the "b" one position before the end. If
the *
were ungreedy, it would consume the minimum
characters that would let the regex match, namely just the initial
"hr". To make a quantifier ungreedy, put a question mark
?
after it.
Example: You can illustrate the difference by copying the above line into
file TEST and then matching it against the extended regexes
<.*>
and
<.*?>
which differ by just the question mark.
Since < and > are special characters to DOS, you would either
use the /F
option to enter those
extended regexes, or enter them on the command line with
encoding sequences like this:
grep /e3 \x3c.*\x3e <test grep /e3 \x3c.*?\x3e <test
(\x3c
and \x3e
are the
encoding sequences for the
<
and >
characters.)
The first extended regex is greedy (by default), and the matching string is
<hr><p align=center>some text <b>in bold</b>
In the second one, the *
quantifier is made ungreedy
by the following ?
, and the matching string is
<hr>
Unless you're using the /J
option
to display just the matching part of the line, why do you care about
greediness?
To be honest, you probably don't.
In ordinary extended regexes, it doesn't make much difference
whether a quantifier is greedy or not, since a greedy quantifier
consumes as many characters as possible without causing the regex
to fail and an ungreedy one consumes as few as possible but again
without causing the regex to fail.
Either way, if a match can be squeezed out it will be.
Where greediness can make a difference is in in more complex extended regexes with capturing subexpressions and back references.
To match any one of a group of characters, enclose them in square
brackets [ ]
.
Examples: [aA]
matches an upper- or lower-case letter A;
sno[wr]ing
matches lines that contain "snowing" or
"snoring".
Immediately after the opening [ or [^,
a right square bracket is just a normal character: []abc]
matches the character ], a, b, or c. A right square bracket after a
left square bracket and at least one other character ends the
character class, though as always you can use a
backslash to make it normal:
[abc\]]
is the same character class as the preceding.
Finally, a right square bracket with no preceding left square bracket
is a normal character.
In an extended regex, certain abbreviations and class names are available for commonly used classes.
You can indicate a character range with the minus sign
or hyphen (-
, ASCII 45).
Examples:
[0-9]
will match any single digit, and
[a-zA-Z]
will match any English letter.
To match any Western European letter (under most recent versions of
Windows, in North America and Western Europe), use the basic regex
[a-zA-ZÀ-ÖØ-öø-ÿ](Note 1. That regex will work fine on the command line with GREP16 or in a file [
/F
option]
with either GREP. But to enter it on the command line with
GREP32, you must use numeric sequences for the 8-bit characters; see Special Rules for the Command Line
below.)
(Note 2. In GREP32, you're better off to
set an appropriate character mapping with the
/M
option and use
an extended regex.
The named character class
[[:alpha:]]
can then replace the above mess.)
A character class can contain both ranges and single characters,
mixed any way as long as each range within the class is written
low-high: T-f
is fine since they are ASCII
84 and 102, but f-T
is invalid.
There is no difference to GREP between writing
out all the characters in a range and using the minus sign to
abbreviate: [pqrsty]
and [ytsrpq]
and
[yp-t]
and [yq-stp]
are just some of the ways
to write the same class.
The minus sign is a normal character outside
square brackets.
It is also a normal character if it occurs at the beginning or end of a
class (immediately after the opening [
or [^
or immediately before the closing ]
character).
To match any character that is not in a class, use square
brackets with a caret or circumflex, (^
,
ASCII 94).
Examples: [^0-9 ]
matches any character except a
digit or a space, and the[^a-z]
matches "the" followed by
anything except a lower-case letter.
Note: The negative character class matches any character not within
the square brackets, but it does match a character. (It might help to
read it as "a character that isn't ..." rather than just "not
...".) For instance,
the[^a-z]
matches "the" followed by something other than a
lower-case letter, but it does not match "the" at the end of a line
because then "the" is not followed by any characters. For further
explanation, please see the lengthy example
under the rules for ^ and $, below.
The caret has a different meaning when it occurs outside square brackets. And when it occurs within square brackets but not immediately after the opening left square bracket, the caret is a normal character.
If you use the /I
option to
specify case-blind matching, then the character class
[abc]
matches an upper-case or lower-case a, b, or c.
With the /I
option in effect, [^abc]
matches
any character except A, a, B, b, C, or c.
Extended regexes support POSIX character class names, such as
[:lower:]
for any lower-case letter and
[:^lower:]
for any character except a lower-case
letter. Notice that you can negate a character class name by putting a
caret after the first colon.
These are not character classes, but special names that you can
insert within square brackets as (part of) a character class. For
instance, the extended regex [AB[:^alpha:]]
matches any
non-alphabetic character or a capital A or B.
Here is the complete list of POSIX character class names. Remember
that they occur inside the normal square brackets for a
character class. Also remember that they
must be surrounded by [: :]
, or
[:^ :]
for negation.
word |
any "word" character (letters, digits and underscore, same as \w) |
alnum |
any letter or digit |
alpha |
any letter |
lower |
any lower case letter |
upper |
any upper case letter |
digit |
any decimal digit (same as \d) |
xdigit |
any hexadecimal digit, decimal digits plus A-F and a-f |
space |
any white space character (same as \s) |
graph |
any printing character, excluding space |
print |
any printing character, including space |
punct |
any printing character, excluding letters and digits and the space character |
ascii |
any ASCII character (see note below) |
cntrl |
any control character |
The exact definitions of the above classes will depend on the
character mapping in effect. In the default C locale, the above
classes match only 7-bit characters (character positions 0-127); in
other mappings, 8-bit characters also match.
You can set the character mapping with the /M
option
.
Use the supplied file TEST255
to test the meaning of any
character class in your selected locale; see
examples in the supplied DEMO.BAT file.
A caret or circumflex (^
, ASCII 94)
at the start of a regex
means that the pattern starts at the beginning of a line in
the file(s) being searched. A dollar sign ($
,
ASCII 36) at the end of a regex means that the pattern
ends at the end of a line in the file(s) being searched.
The caret and dollar are sometimes called anchors because they anchor a regex to the start or end of a line (or both). They're also the two best-known examples of assertions, constructs that match a condition rather than a character.
Examples:
^[wW]hereas
matches the word "Whereas" or
"whereas" at the start of a line, but not in the middle of a line.
Blanks are not ignored, so if you want to find that word whenever it's
the first word of the line, you need to use a pattern like
^ *[wW]hereas
to allow for indention.
^$
will match only lines that contain no characters at
all.
^ *$
will match lines that contain no characters
or contain only spaces.
^ +$
will match lines that contain only spaces,
but not empty lines.
^[A-Za-z]+$
will find every line that contains
nothing but one or more English letters.
^ *[a-z]+ *$
will find every line that contains
exactly one lower-case English word, possibly preceded or followed by
blanks.
You should probably use ^
and $
only in
text mode or
record-oriented binary mode.
Also, they make sense only at the beginning and end of your regex. For
those who prefer to live life on the edge, here are the full
rules:
Basic regex | Extended regex | |
---|---|---|
With line-oriented text or record-oriented binary
( /R0 or /R2 ) |
^ at the start of a basic regex matches
the start of a line or record.
$ at the end of a basic regex matches the end of a line
or record.
Both ^ and $ are normal characters
everywhere else except within square
brackets. |
^ and $ outside
square brackets always mean start and end of
a line or record. If you misplace them, your extended regex won't match
anything. |
With free-form binary ( /R3 ) |
In a basic regex, ^ and $ outside
square brackets don't match anything useful. |
In an extended regex, ^ and $ outside
square brackets match a newline (ASCII 10). |
When GREP senses file format ( /R-1 or /R-2 ) |
Don't use ^ and $ in a
regex with the /R-1 or /R-2
option. If you do use them, they work correctly in text files, but
in binary files they match the start and end of every buffer,
arbitrary file positions that are not likely to be useful. |
It's a historical artifact that the rules for basic and extended regexes are not quite the same.
Suppose you want to find the
word "the" in a file, whether in caps or lower case. You can use the
/I
option
to make the search case blind, and concentrate
on constructing the regexes.
At first glance,
[^a-z]the[^a-z]
seems adequate: anything other than a
letter, followed by "the", followed by anything but a letter. That
lets in "the" and rules out "then" and "mother". But it also rules
out "the" at the beginning or end of a line. (Remember that a negative
character class does insist on matching some character. Read it as
"any character other than ..." rather than as simply "not...".) The
solution with basic regexes requires four of them, for "the" at the
beginning, middle, or end of a line, or on a line by itself:
^the[^a-z] [^a-z]the[^a-z] [^a-z]the$ ^the$To search for just the occurrences of the word "the", put those four lines in a file and then use the
/F
option on GREP.
But this becomes much easier if you use the power of extended
regular expressions (/E2
option).
You can search for the word "the", not embedded in larger words, with
one extended regex:
\bthe\b
Read this as "a word boundary, followed by t-h-e, followed by a word boundary." As you would expect, start and end of line count as word boundaries.
(Technically, there could be a problem with the above regular expression: it would not match "the6" or "the_" since the underscore and the digits are considered "word characters". It's not likely you'd get such sequences in a text file, but if you want to be absolutely precise you should use alternatives with subexpressions in the extended regex
(^|[^a-z])the($|[^a-z])
Again assuming the /I
option, you would read this as "either start of line or a
non-letter, followed by t-h-e, followed by either end of line or a
non-letter". It's a little bit nasty, but still it's probably easier
than typing in four regexes.
In an extended regex only, the vertical bar (|
,
ASCII 124) separates two or more alternatives. The extended regex will match
lines that contain any of the alternatives. It is legal for an
alternative to be empty, and this can be useful in
subexpressions.
Example: the extended regex cat|dog
will match any
input line that contains the string "cat" or "dog".
If you want alternatives for part of an extended regex, use parentheses or round brackets to form a subexpression. See the Lengthy Example immediately above, and more examples in the section on subexpressions.
If you are matching alternatives that must occur at the start of
end of a line, the anchor needs to be in each alternative. Example: to
match lines that start with "cat" or "dog", use ^cat|^dog
as your extended regex. Another way to express that is with a
subexpression, ^(cat|dog)
.
Efficiency note: Alternatives can be slower than character classes.
The extended regex bar|bat
or ba(r|t)
is logically equivalent to the basic or extended regex
ba[rt]
, but the latter will generally execute faster.
You may or may not notice any time difference, depending on the speed
of your computer and the size of the files that you're searching.
Caution: The vertical bar |
has special meaning on the
DOS command line. If your operating system doesn't let you override
that meaning, use the
/F-
option to enter it from the
keyboard, or see Backslash for
Character Encoding below.
In an extended regex only, the parentheses or round brackets have several uses, but only two will be discussed in this reference manual.
The first use is straightforward: to set up alternatives as part of an extended regex. For example, the extended regex
the quick (brown fox|white rabbit)matches lines containing either "the quick brown fox" or "the quick white rabbit". Here's another example, adapted from the PCRE manual page:
cat(aract|erpillar|)s
matches lines containing
"cataracts", "caterpillars", or "cats".
The second use of parentheses is to set up a "capturing subpattern", which can be referred to with a "back reference" using a backslash; see below.
For advanced topics not covered in this reference manual, please see Final Thoughts, below.
Parentheses are not special inside square brackets, or anywhere in a basic regex.
The backslash (\) has quite a number of uses.
First and simplest, when the backslash precedes any
non-alphanumeric character it makes that character normal. For
example, the regex 2+2
normally matches a string of two or more
2s. (The 2+
construct means "one or more occurrences of
the character 2".) If you want to match that middle character as an
actual plus sign, you must "escape" it with a backslash:
2\+2
.
If you want to match a backslash itself, you escape it in the same
way. For example, the regex ^c:\\
matches every line that begins
with "c:\".
The backslash functions as an escape both inside and outside of
square brackets. If you are not sure when
a non-alphabetic character like ]
or $
is
special and when it is not, just precede it with a backslash and it
will be a normal character, even if it would have been normal
anyway.
Example: To match any of the four signs of arithmetic, you might write
the regex [+-*/]
. But that minus sign has a
special meaning inside square brackets.
To treat it as a normal character you must escape it with the
backslash, like this: [+\-*/]
.
This is the only use of the backslash in basic regexes; the rest all relate to extended regexes.
Many regexes involve a type of character: digit (or not), blank (or not). While you can always use ordinary character classes, in an extended regex you can also use these shortcuts on their own or as part of a character class:
\w |
any "word" character, meaning any letter or decimal digit or an underscore |
\W |
any character except a "word" character |
\d |
any of the decimal digits |
\D |
any character except a decimal digit |
\s |
any whitespace character: tab, space, and so on |
\S |
any character except a whitespace character |
The exact definitions of the above types will depend on the
character mapping in effect. In the default C locale, no 7-bit
characters (characters 128-255) are considered as possible
"word" characters, digits, or whitespace; in
other mappings, some 8-bit characters also match.
You can set the character mapping with the /M
option
.
Use the supplied file TEST255
to test the meaning of any
character type in your selected locale; see
examples in the supplied DEMO.BAT file.
Example: To scan a file for four-digit numbers, your regex
could repeat the \d
four times or use
curly braces: \d\d\d\d
or
\d{4}
.
Did you spot the problem with the preceding example? Yes, either of those extended regexes matches lines containing four-digit numbers. But it also matches lines containing five-digit numbers, since a five-digit number contains four consecutive digits. One way to match numbers of exactly four digits is to mark them as being preceded by start or line or a non-digit, and followed by end of line or a non-digit:
(^|\D)\d{4}($|\D)Of course, if you know something about the files you're scanning you may not need to get so elaborate.
Example: To scan for four hexadecimal digits, use the extended regex
[\da-fA-F]{4}(This one has the same problem as the previous example: it also matches five or more hex digits. Fixing it is left as an exercise for the reader!)
The assertions in this section look like the above
character types, but there's an
important difference. The difference is that while a character type
matches a character of specified type, an assertion matches a position
in the line and doesn't "consume" a character. (You already know two
examples of assertions, namely the anchors
^
and $
.)
\b |
word boundary, namely the transition between a word and a non-word character or vice versa, or the beginning or end of line if the adjacent character is a word character |
\B |
not a word boundary |
\A |
same as ^ : start of input line (text mode) or
record (binary mode) |
\z, \Z |
same as $ : end of input line (text mode) or
record (binary mode) |
These assertions are not valid inside square
brackets, and in fact \b
has a different meaning
inside a character class; see Backslash
for Character Encoding, below.
Outside square brackets, a backslash
followed by a number > 0 is interpreted as a back reference to
a capturing subpattern in the regex. For
example, \6
refers to the sixth capturing subpattern in
the extended regex.
Example (from the PCRE man page): the extended regex
(sens|respons)e and \1ibilitymatches "sense and sensibility" or "response and responsibility" but not "sense and responsibility". A back reference always refers to the actual matching subpattern in this particular instance, not to just any alternative.
Example: U.S. toll-free area codes are 800, 888, 877, 866 (and soon
855). The regex 8[08765]{2} would be wrong because it would match
strings like "867" and "808". You need a back reference to ensure that
the third digit is the same as the second: 8\([08765]\)\1
is your regex. That says you must have an 8, followed by 0, 8, 7, 6,
or 5, followed by a second occurrence of the same digit.
A "back reference" can actually be a forward reference: any of
\1
through \9
refers to the first through
ninth capturing subpattern in the extended regex, even if that
subpattern comes after the "back reference" in the regex. But
\10
and greater can refer only to subpatterns that
precede the back reference. If something looks like a back reference
but the number is greater than 9 and greater than the number of
capturing subexpressions to the left of it, it is read as
an encoded character in octal.
The last use of backslash documented in this reference manual is to
encode certain characters, either non-printing characters or those
that DOS doesn't allow on the command line. These rules are nasty, and
I recommend you use the
/F
option, if possible, to enter
a messy regex from the keyboard or in a file.
(Please note that these rules for extended regexes are quite different from the Special Rules for the Command Line that apply to basic regexes. It's an unfortunate incompatibility, but neither can be changed because PCRE is a supplied library for extended regexes and users rely on existing behavior of basic regexes.)
Except as noted, each of these sequences has the indicated meaning anywhere in an extended regex:
\a |
"alarm", the BEL character, ASCII 7 |
\b |
"alarm", the backspace character, ASCII 8, but only inside square brackets. Outside square brackets it is an assertion. |
\cx |
a control character. If x is a letter, it's straightforward:
\cb and \cB are both Control-B, ASCII 2.
If x is not a letter, it is XORed with 64 (hex 40). |
\e |
escape, ASCII 27 |
\f |
form feed, ASCII 12 |
\n |
"newline", line feed, ASCII 10. This character will never be seen in a text file, since it marks a line break. It can occur in a binary file. |
\r |
carriage return, ASCII 13. This character will never be seen in a text file, since it marks a line break. It can occur in a binary file. |
\t |
tab, ASCII 9 |
\xhh |
character with the given hex code hh (zero, one, or two
digits).
Examples: \x7c or \x7C is hex 7C (ASCII 124), the
| character. \x or \x0 or
\x00 is the NUL character, ASCII 0. |
\0dd |
octal number of one to three digits. \032 is
Control-Z, ASCII 26. |
\ddd |
This sequence, one to three digits where the first one is not zero,
is complicated. Outside square brackets,
it's read as a decimal number and is interpreted as a back reference (above) if
possible. Otherwise, or always inside square brackets, it's read as an
octal number and the least significant 8 bits are taken as its
value. Examples: \7 is a back reference.
\11 is a back reference if there have already been eleven
capturing subpatterns; otherwise it's octal 11, ASCII 9, the tab
character. |
An extended regex can have further special constructs beyond those documented here. If you need more, please refer to the full documentation at <http://www.pcre.org/man.html>. That document contains information about writing programs using PCRE, as well as complete specification of the PCRE (extended) regexes. For completion, a copy of it with just the information relevant to GREP users is provided as document PCRE.HTM.
Advanced users may want to consult that document for the following topics:
The special rules below get around three problems:
-
and /
introduce options, GREP
will try to interpret your regex or search string as GREP options if
it begins with either of those characters.
Therefore GREP defines some special sequences starting with a
backslash \
to let you get problem characters into your
regex.
These rules date back to a much earlier release of GREP, and while there are better ways available now, the rules are maintained for users who have come to rely on them. Beginning with release 6.0, you can turn the special rules on or off.
You need them only when
You are entering a regex or search string on the command line (no
/F
option), and
Your regex or search string contains special characters like
<
, |
, >
, space, and
semicolon. (DOS variants like 4DOS have different sets of special
characters.)
When you select extended regexes (/E2
option), you probably don't want the special rules given
below. Extended regexes come with
their own ways of using a backslash
for character encoding.
When you have a messy regex, I recommend you use the
/F-
or /F
file
option to enter it from the keyboard or store it in a file. This
gets around the whole problem.
When you specify the /E
n
option, you also turn the special rules on or off by following the
number n with a backslash (\
), or not.
Example: /E0
specifies simple string search and turns the
special rules off; /E0\
specifies simple string search and
turns the special rules on.
If you never specify an /E
option, GREP takes your
regex as a basic regex and turns on the special rules. This is the
same as the /E1\
option, and it was the only choice
until GREP release 6.0. If you want GREP to work with basic regexes
but without the special rules, specify the /E1
option.
If your regex begins with a minus (-
) or slash
(/
), GREP will try to interpret it as an option. Example:
if you're searching for the string "-in-law", GREP will think you're
trying to turn on the options /I
, /N
, and so
on. To avoid this problem, use a leading backslash
(\-in-law
).
If your regex contains certain special characters like
<
, =
, and |
, DOS will give
those characters their special DOS meaning and GREP will never see
them.
So you must use special "escape sequences" to represent those
characters in a regex on the command line, as follows:
instead of | you can use any of |
---|---|
< (less) |
\l \60 \0x3C \074 |
> (greater) |
\g \62 \0x3E \076 |
| (vertical bar) |
\v \124 \0x7C \0174 |
" (double quote) |
\" \34 \0x22 \042 |
, (comma) |
\c \44 \0x2C \054 |
; (semicolon) |
\i \59 \0x3B \073 |
= (equal) |
\q \61 \0x3D \075 |
(space) | \s \32 \0x20 \040 |
(tab) | \t \9 \0x09 \011 |
(escape) | \e \27 \0x1B \033 |
You can enter any character as a numeric sequence, not just the
special characters in the above list. Use
decimal, hex (leading 0x
), or
octal (leading zero). Example: capital A would be
\65
, \0x41
, or \0101
.
ASCII 0 is not legal in a basic regex: it causes the rest of the
regex to be silently ignored. Either code something like
[^\1-\255]
("any character except ASCII 1 to 255") in
your basic regex, or use an
extended regex.
Finally, if your regex contains 8-bit characters, Microsoft's 32-bit
startup code (not DOS) will translate these characters from a DOS
character set to a Windows character set, which is probably not what
you want. To avoid this problem, you can
use the numeric sequences to enter characters. Example: In a regex
on the command line, instead of actually typing the character
é
, enter it as \233
or
\0xE9
or \0351
. This problem affects GREP32,
not GREP16.
Remember, the rules in this section are required only to get around
problems with special characters in basic regexes on the command line.
These workarounds are not needed, and don't work, when you use
the /F
option to enter regexes in a
file or from the keyboard.
Extended regexes have their own rules; see
Backslash for Character Encoding,
above. Yes, you can have the special rules with an extended regex
(/E2\
option), if
you really like using lots of backslashes.
When the special rules are in effect, you can find out how GREP
applied them by using the /D
option
and looking for the "massaged" string or regex.
[ back to user guide ]