GREP -- Find Regular Expressions in Files
Reference Manual

program release 6.9 of 22 December 2001
Copyright © 1986-2001 by Stan Brown, Oak Road Systems

GREP is a filter that searches input files, or the standard input, for lines that contain matches for one or more patterns called regular expressions and displays those matching lines. GREP can also search binary files and display records or buffers that contain matches.

This is a reference manual, including all the command-line options and a detailed description of regular expressions. For an overview of GREP, please read the user guide first. (A full revision history is also available.)


Contents


       Options
       Input File Options
       /A -- Include Hidden and System Files
/Rn -- Read and Display Input Files as Binary or Text
/S -- Scan Subdirectories
/Wtxwid or /Wtxwid,bnwid -- Specify Line Width or Binary Block Length
Pattern-Matching Options
/Eregex_level -- Select Extended Regexes or Strings
/Ffile or /F- -- Read Regexes from File
/I -- Ignore Case in Matching
/Mmapping -- Specify Character Mapping or Locale
/V -- Display Lines That Don't Contain a Match
/Y -- Multiple Regexes AND instead of OR
Output Options
/B -- Display a Header for Every File Scanned
/C -- Display Only the Count of Matches
/H -- Don't Display Filenames in Output
/J -- Display Just the Part of Each Line That Matches
/L -- Report Only Names of Files That Contain Matches
/N -- Prefix Line Numbers to Matching Lines
/Pbefore,after -- Show Context Lines around Matching Lines
/U -- UNIX-style Output
General Options
/Dfile or /D or /D- -- Display Debugging Output
/Qlevel -- Suppress the Logo and Unwanted Warnings
/Z -- Reset All Options
/0 or /1 -- Set ERRORLEVEL to Show Whether Matches Were Found
/? -- Display Help
Environment Variable
Regular Expressions (Regexes)
Overview
Examples
Basic and Extended Regexes
Compatibility Note
Normal Character (any regex)
. for Any Character (any regex)
* or + for Repetition (any regex)
? for Optional Match (extended regex)
{...} for Repetition (extended regex)
Greedy Quantifiers (extended regex)
[...] for Character Class (any regex)
- for Character Range (any regex)
Negative Character Class (any regex)
Character Class and Case-Blind Matching (any regex)
Character Class Names (extended regex)
^ and $ for Start and End of Line (any regex)
Lengthy Example
| for Alternatives (extended regex)
(...) for Subexpressions (extended regex)
The Backslash \
Backslash as Escape (any regex)
Backslash for Character Types (extended regex)
Backslash for Assertions (extended regex)
Backslash for Back References (extended regex)
Backslash for Character Encoding (extended regex)
Some Final Thoughts (extended regex)
Special Rules for the Command Line
When Do You Need the Special Rules?
How Do You Turn the Special Rules on or Off?
What Exactly Are the Special Rules?

 

Options


Four sections below describe the options in detail, by functional groups: input file options, pattern-matching options, output options, and general options.

Input File Options

/A -- Include Hidden and System Files

      

Include hidden and system files when expanding wild cards (* and ?) in file specifications. Without this option, GREP will ignore hidden and system files while searching for files that match a wild card. However, if you explicitly specify a file on the command line, GREP will always read it even if it's a hidden or system file.

The /A option also modifies the action of the /S option (if present), determining whether subdirectories marked hidden or system will be searched.

/Rn -- Read and Display Input Files as Binary or Text

      

Process named input files as text or binary. (Please see Binary Files and Text Files for detailed information about the differences.) You can choose from

/R0
          Read all input files as text. (This is the default.)
 
/R1
  (reserved for future use)
 
/R2
  Read all input files as record-oriented binary. The fixed record length is given by the /W option.
 
/R3
  Read all input files as free-format binary. The /W option gives the buffer size. (To find all matches, make sure your buffer size is at least twice the longest string you expect to find.)
 
/R-2(registered version only)
  Examine each input file to decide how to read and display it. If any of the characters ASCII 0-6 or 14-26 are found, GREP treats the file as free-format binary (like /R3); otherwise GREP treats the file as text (like /R0).
 
Please see addditional comments after /R-1, immediately below.
 
/R-1(registered version only)
  Examine the first 256 bytes of each input file to decide quickly how to read and display it. If any of the characters ASCII 0-6 or 14-26 are found, GREP treats the file as free-format binary (like /R3); otherwise GREP treats the file as text (like /R0).
 
Comparison of /R-1 and /R-2
 
/R-2 reads the entire file where /R-1 reads only the first 256 bytes to make a decision. Experiments show that 256 bytes is plenty for a correct decision for most file types, including picture files, executable programs, and MS Office files of all types. Adobe Acrobat PDF files are an exception, in that the first binary byte shows up well after byte 256; but the displayed text is encrypted in those files so you can't search for text in them anyway. (If anyone knows of another file type where binary bytes show up only after byte 256, I'd be grateful for information.)
 
Thus /R-2 is theoretically safer than /R-1, but by the same token /R-2 will be slower on a big file that is actually text. The difference may or may not be noticeable, depending on how fast your disk and your CPU are and how your operating system buffers file reads.
 
Which one should you use? My own choice is to put put /R-1 in the environment variable. That way I am confident that GREP will correctly sense the type of non-PDF binary files, yet not take a long time to decide that a big text file is actually text.
 
When you use the /R-1 or /R-2 option, GREP will display the actual file type (text or binary) with the filename header, unless you use the /H option to suppress filename headers.
 
One caution: After GREP examines the bytes mentioned above, it either rewinds the file (if it's binary) or closes and reopens it (if it's text). Ordinarily that's not a problem, but if you specify a pseudo-file like COM1 or CON, the bytes that were used to decide whether it's a text file will be discarded. Use /R-1 or /R-2 only with real files.

Setting the /R option correctly lets you search for regexes in .EXE and .DLL files, word-processing files, and so forth. /R-1 or /R-2 can be particularly useful when you don't know whether files are text or binary. (For instance, Microsoft Word writes some .DOC files in a binary format and some in a text format. Or you might have some source files and some object files and want to search them all in one go.)

Up through release 6.0, GREP had a single mixed binary mode controlled with the single-letter /R option. The /R option must now be followed by a number to select a specific mode.

Only named input files are read in binary mode. Regardless of the /R option value, when you use the /F option to read regexes from a file, that file is read in normal text mode. Also, if you don't specify any input files, GREP always scans the standard input (possibly piped with | or redirected with <) in text mode.

/S -- Scan Subdirectories

      

Please see the section on subdirectory searches in the user guide.

/Wwidth or /Wtxwid,bnwid -- Specify Line Width or Binary Block Length

      

Expect text lines up to txwid characters long, or process binary files in records or buffers of bnwid bytes. (If you specify only one number, it's used for both txwid and bnwid.)

txwid and bnwid default to 4096 in GREP32, and you can specify anything from 2 to 2147483645; the default for GREP16 is 256 and you can specify 2 to 32765. (The widths are also limited to available memory, which will depend on your system configuration, what other programs you have running at the time, and what you specify with the /P option. With GREP32, available memory includes Windows virtual memory.)

(For full details of binary and text file modes, please see that section in the user guide.)

Text mode (/W option without /R or with /R0)

The CR/LF (ASCII 13 or 10 or both) at the end of line don't count against the specified txwid. If GREP reads a long line from the input, it will break it after txwid+1 characters and treat the remainder as a separate line. The whole line gets scanned, but any match that starts before the break and ends after the break will be missed. Therefore, if possible you should set txwid large enough to hold the longest line in the file.

If GREP does find any lines longer than the specified or default txwid, it will display a warning message at the end of execution, telling you the length of the longest line. (This warning is suppressed by the /Q3 option.) GREP will also log each such file in the debug output; look for "exceeds txwid".

Record-oriented binary mode (/W option with /R2)

Files are read in binary mode, in records of bnwid bytes.

Free-form binary mode (/W option with /R3)

Files are read in binary mode, in buffers of bnwid bytes. bnwid must be an even number. The recommended value of bnwid is at least twice the longest string you expect to find. For instance, if you're searching for a regex that might match up to 40 characters, you want to specify /R3 /W80, since 2×40=80. If you're not sure just how long a string in the file will match your regex, it's better to overestimate a bit than to underestimate.

An internal procedure ensures that if a match exists in the file it will be found, provided the match is not longer than half the buffer. (As always, if one buffer contains multiple matches only the first match in that buffer will be counted.)

When GREP chooses file mode (/W option with /R-1 or /R-2)

txwid is used as a line width for any file that is treated as a text file, and bnwid is used as buffer width for any file that is treated as free-form binary. bnwid must be an even number. If you specify only one number with /W, it is used for both purposes and must be an even number.

Pattern-Matching Options

/Eregex_level -- Select Extended Regexes or Strings

      

This option tells GREP how to interpret the regex(es) you enter on the command line, from keyboard, or in a file.

/E0
          Don't use regular expressions at all. Treat the regex(es) as simple literal strings and search files for exact match with no special treatment of any characters.
 
/E1
  Treat regexes as basic regexes. This is how GREP always worked before release 6.0, and it is still the default.
 
/E2(GREP32 only)
  Treat regexes as extended regexes.
 
/E3(GREP32 only)
  This was valid in release 6.0 but has been replaced by /E2 /J. The /J option tells GREP to display just the part of each line that matches the regex, and can be used with any level of the /E option.

Basic and extended regexes are fully explained under Regular Expressions, later. An extended regex supports all the features of a basic regex plus the quantifiers ? and {...}, alternatives |, subexpressions (...), some special constructs with the backslash \, and more.

The /E option can have a backslash suffix to turn on the Special Rules for the Command Line (q.v.). The Special Rules let your regex contain certain characters that are normally reserved by DOS. The Special Rules are turned off when you specify the /E option without a trailing backslash.

If you never specify the /E option at all, the effect is the same as /E1\, which is basic regexes with the Special Rules enabled; this default was chosen to match GREP's behavior before release 6.0. /E with no number is the same as /E1, which specifies basic regexes without the Special Rules.

/Ffile or /F- -- Read Regexes from File

      

GREP reads one or more regexes from file instead of taking a single regex from the command line, and reports lines from the input file(s) that match any of the regexes read from file. You must enter the regexes one per line in the file; don't put quotes around them. An empty file contains no regexes, and therefore matches nothing.

file must follow the /F with no intervening space, and the filename ends at the next space. If you use a minus sign as the filename (/F- option), GREP will accept regexes from standard input. Don't do this if you are redirecting file input from a file with the < character!

When you supply two or more regexes, GREP normally reports each line from the input file that matches any (at least one) of the regexes. If you set the /V option or /Y option or both, you modify that behavior according to the rules of logic. Specifically:

  • With /Y but not /V, GREP reports only the lines that match all of the regexes.
  • With /V but not /Y, GREP reports only the lines that match none of the regexes. (If the input line matches one or more of the regexes, GREP doesn't report it.)
  • With /V and /Y, GREP reports every line that matches less than all of the regexes, i.e. every line that matches 0 to N-1 of your N regexes. (If the input line matches all the regexes, GREP doesn't report it; if it matches some of the regexes but not all, or none of the regexes, GREP reports it.)

(The /Ffile option is active only in the registered version. /F- works in the evaluation version and the registered version.)

/I -- Ignore Case in Matching

      

Ignore case, treating caps and lower case as matching each other.

Caution: By default, the /I option does not apply to 8-bit characters (characters 128-255). You can turn on 8-bit character support in GREP32 with the /M option.

In GREP16, the /I option does not apply to 8-bit characters (characters 128-255) because Microsoft C 16-bit code does not support setting the locale. Therefore, if you want case-blind comparisons in GREP16, you must explicitly code any 8-bit upper and lower case in your regex. For instance, to search for the French word "thé" in upper or lower case, code it as th[éÉE] since é can be upper-cased as É or as plain E. The "th", being 7-bit ASCII characters, will be found as upper or lower case by the /I option. (You may need to code 8-bit characters like éÉ in a special way if you enter them on the command line; either use the /F option or see Special Rules for the Command Line below.)

/Mmapping -- Specify Character Mapping or Locale

      

Set the character mapping or locale. This option is available only in GREP32, because Microsoft 16-bit C does not support setting the locale. There are three issues with locale: binary output, case-blind matching, and character classes. Details about all three are given below, after the list of mappings.

While many mappings (locales) are supported in GREP32, for our purposes most are duplicates. The six unique locales are

  • /Mfr -- code page 1252, valid for most European languages including Danish, Dutch, English, Finnish, French, German, Icelandic, Italian, Norwegian (both), Portugese, Spanish, and Swedish.
  • /Mcsy -- code page 1250, valid for Czech, Hungarian, Polish, and Slovak
  • /Mell -- code page 1253, valid for Greek
  • /Mrus -- code page 1251, valid for Russian
  • /Mtrk -- code page 1254, valid for Turkish
  • /Mc -- the default C locale, in which characters 128-255 are all considered non-letters

The recommended strategy is to put an /M option in your environment variable with the appropriate mapping and then forget about it. The mapping affects the following three issues.

First, when displaying file contents in binary mode, GREP displays each non-printing character as the four-byte sequence <nn>, where nn is the hexadecimal value of the character. GREP32 uses the current mapping to decide what is and is not a printing character.

Second, in case-blind matching (/I option), the default C locale knows only about the English alphabet A-Z and a-z, with no accent marks. If you're doing case-blind matching and your input files may contain accented characters like é (character 233) or É (201), or non-English letters like Å (197) or å (229), you should use the /M option with the appropriate mapping from the above list.

Finally, character types and character class names in extended regular expressions (/E2 option). The question of what is a letter, or a punctuation mark, or a word boundary, will be different in different locales. If you're using extended regexes with character types or named character classes, and your input files may contain accented letters or non-English letters, you need the /M option.

You can use the supplied file TEST255 to test the meaning of any character class in your selected locale; see examples in the supplied DEMO.BAT file.

/V -- Display Lines That Don't Contain a Match

      

Show or count the lines that don't match instead of those that do.

For the effect of the /V option with two or more regexes, see the /F option.

The /V option is not allowed with the /J option: it doesn't make any sense to display only non-matches but display the part of each line that was a match.

/Y -- Multiple Regexes AND instead of OR

      

When multiple regexes are being sought (/F option), report as matching lines only the lines from the input files that match all the regexes in any order.

For example, if you use the /F option and enter the two regexes brown and fox, then all of these lines will match:

        The quick brown fox
        I see a brown smudge
        Crazy like a fox
        The fox's tail is brown

But if you also use the /Y option, then GREP will match only lines that contain both the regular expressions, namely the first and fourth lines in the example. In other words, multiple regexes are normally joined by OR, but with the /Y option they are joined by AND.

As you see from the example, with the /Y option, input lines must match all the regexes, but in any order. If you want to match all regexes in a specific order, specify them as a single regex connected with ".*". For instance, to match lines that contain "brown" somewhere before "fox", use the regex brown.*fox. This matches lines that contain "brown" with "fox" somewhere later on the line, including lines with "brownfox".

While not actually forbidden, the /Y option usually doesn't give useful results with the /R3 option.

For the effect of the /V option with the /Y option, see the /F option.

Output Options

Before going through the output options, let's take a moment to look at some of the possible output formats. By default, GREP's output is similar to that of DOS FIND:

        ---------- GREP.C
                op_showhead = ShowNoHeads;
                else if (op_showhead == ShowNoHeads)
                op_showhead = ShowNoHeads;

        ---------- GREP_MAT.C
                op_showhead == ShowNoHeads)
However, the /U option produces UNIX grep-style output like this:
        GREP.C:        op_showhead = ShowNoHeads;
        GREP.C:        else if (op_showhead == ShowNoHeads)
        GREP.C:        op_showhead = ShowNoHeads;
        GREP_MAT.C:        op_showhead == ShowNoHeads)
As you can see, the main difference is that DOS-style output has the filename as a header above the group of matching lines from that file, and UNIX-style output has the name of the file on every matching line.

The output options give you a lot of control over what GREP produces, but they can be confusing. Here's the executive summary:

Now, in alphabetical order, here are the options that control what GREP outputs and how it is formatted.

/B -- Display a Header for Every File Scanned

      

Display a header for every file examined, even if the file contains no matches. (This option is meaningful only with DOS-style output, when the /U option is not set.)

/C -- Display Only the Count of Matches

      

Display only a count of the matching lines in each file, instead of the matching lines themselves.

Lines are counted, not matches. If a match occurs several times on a line, or several regexes match the same line, the line is counted only once. You cannot use the /C option to get a full count of the number of matches in the file, unless you know that the match doesn't occur more than once on any line.

(For binary files, read "record or buffer" for "line". For free-form binary, the buffer size may affect how many matching buffers are found, since multiple occurrences in one buffer are counted only once.)

/H -- Don't Display Filenames in Output

      

Don't display any filenames as headers.

The /H option is most appropriate when you're using GREP as a filter to extract lines from a file for processing by another program, like this:

    grep /H "Directory" <inputfile | other program

If you want to keep the file name with each extracted line, use the /U option.

/J -- Display Just the Part of Each Line That Matches

      

Display just the portion of each line that matches the input regex, not the whole line containing a match. If a given line contains multiple occurrences of the regex, only the first occurrence will be displayed.

The /J option behaves similarly for binary files (/R2 or /R3 option): it displays only the portion of each binary record or buffer that matches the regex. If more than one match occurs in the record or buffer, GREP displays only the first.

If you specify multiple regexes (/F option), GREP displays the part of the line/record/buffer that matches the first matching regex; however, if you also specify the /Y option (all regexes must match), then GREP displays the part of the line/record/buffer that matches the last regex.

The /J option is not allowed with the /V option: it doesn't make any sense to display only non-matches but display the part of each line that was a match.

/L -- Report Only Names of Files That Contain Matches

      

Display only a bare list of the names of files that contain matches, not the actual lines that match.

The /L option and /V option together will display the names of files that don't contain any matches.

/N -- Prefix Line Numbers to Matching Lines

      

Show the line number before each matching line. DOS-style output with the /N option looks like this:

    ---------- GREP.C
    [ 144]        op_showhead = ShowNoHeads;
    [ 178]        else if (op_showhead == ShowNoHeads)
    [ 366]        op_showhead = ShowNoHeads;

    ---------- GREP_MAT.C
    [  98]        op_showhead == ShowNoHeads)

With /N and the /U option used together, the UNIX-style output looks like this:

    GREP.C:144:        op_showhead = ShowNoHeads;
    GREP.C:178:        else if (op_showhead == ShowNoHeads)
    GREP.C:366:        op_showhead = ShowNoHeads;
    GREP_MAT.C:98:        op_showhead == ShowNoHeads)

UNIX-style output is suitable for use with the excellent freeware editor Vim.

When displaying a buffer from a free-format binary file -- either under the /R3 option or because you specified the /R-1 or /R-2 option and GREP sensed that the file was binary -- the line number is replaced by a byte number, in hex, with a leading "b" for "byte". The first byte in the file is numbered 0.

/Pbefore,after -- Show Context Lines around Matching Lines

      

Show context lines before and after each match. If you omit after, GREP will show the same number of lines after each match as before. Plain /P is the same as /P2,2.

Either number can be 0. For instance, use /P0,4 if you want to show every match and the four lines that follow it.

If you use the /P option, you probably want to use the /N option as well, to display line numbers. In that case, the punctuation of the line numbers will distinguish which lines are actual matches and which are displayed for context. Here is some DOS-style output from a run with the options /P1,1N set:

    ---------- GREP.C
      143     if (opcount >= argc)
    [ 144]        op_showhead = ShowNoHeads;
      145
      177             PRTDBG "with each matching line");
    [ 178]        else if (op_showhead == ShowNoHeads)
      179             PRTDBG "NO");
      365     if (myToggle('L') || myToggle('U') || myToggle('H'))
    [ 366]        op_showhead = ShowNoHeads;
      367     else if (myToggle('B'))

    ---------- GREP_MAT.C
       97         op_showwhat == ShowMatchCount ||
    [  98]        op_showhead == ShowNoHeads)
       99         headered = TRUE;

As you can see, the actual matches have square brackets around the line numbers, and the context lines do not. (In UNIX format, with the /U option in addition to /N and /P, GREP displays colons around the numbers of matching lines and spaces around the numbers of context lines.)

Interactions between the /P option and the /R option:

  • With the /R2 option, GREP will display the indicated numbers of binary records before and after any record that contains a match.
  • With the /R3 option, the /P option is not allowed.
  • With the /R-1 or /R-2 option, GREP will honor the /P option when reading text files but ignore it when reading binary files.

GREP16 has to allocate space for the preview lines within the same 64 K data segment as all other data. Consequently, if you specify a moderately large value, particularly with a large line width (/W option), you may get a message that GREP can't allocate space for the lines. To resolve this, use GREP32 if possible; otherwise either reduce either the line width or the first number after /P (the before number); the second number, after, has no effect on memory use.

/U -- UNIX-style Output

      

Show the filename with each matching line, instead of just once in a separate header. This UNIX-style output is useful with editors like Vim that can automatically jump to the file that contains a match. Some examples of UNIX-style output were given at the beginning of this section.

There's one small difference from UNIX grep output: UNIX grep suppresses the filename when there is only one input file, but GREP assumes that if you didn't want the filename you wouldn't have specified the /U option. Neither GREP and UNIX grep displays a filename if input comes from a file via < redirection.

In addition to these options, under the /R2 or /R3 option GREP reads files in binary mode, and that has a side effect on the output format.

Some combinations of output options are logically incompatible. For instance, /H/L makes no sense (don't list filenames, and list the names of files that contain matches). In such cases, GREP will turn off one of the incompatible options and tell you what it did (unless you suppress such messages with the /Q2 or /Q3 option). The incompatibilities are just common sense, but are listed here for completeness:
       /B   overrides /H but is ignored with /L or /U
       /C   overrides /H, /J, /L, /N, /P
       /H   ignored with /B, /C, /L, /U
       /J   ignored with /C, /L
       /L   overrides /B, /H, /J, /N, /P, /U but is ignored with /C
       /N   ignored with /C or /L
       /P   ignored with /C or /L
       /U   overrides /B and /H but is ignored with /L

General Options

/Dfile or /D or /D- -- Display Debugging Output

      

Debugging information includes whether you're running GREP16 or GREP32, whether the program is registered, the contents of the environment variable, the values of all options specified or implied, the files specified, the raw and interpreted values of the regex(es), details of every file scanned, execution timings, and more. This information is normally suppressed, but you may find it helpful if GREP seems to behave in a way you don't expect or if you have a bug report.

Since the debugging information can be voluminous, if you want to see it at all you will usually want to specify an output file:

/Dfile
          Write all debug information to the named file. file must follow the D with no intervening space, and the filename ends at the next space. GREP will append to the file if it already exists.
 
/D
  Send debugging information to the standard error output (normally the screen). Be careful not to specify any other options between /D and the next space, or they'll be taken as a filename.
 
/D-
  Send debugging information to the standard output, which you can redirect (>) or pipe (|). This intersperses debug information with the normal output of GREP.

You can weed through the debugging output to some extent. GREP writes the following unique strings on most lines of output, so you can send debug output to a file and then grep the file for

/Qlevel -- Suppress the Logo and Unwanted Warnings

      

Set the quietness level, to suppress messages you may not want to see.

/Q0
          (default) Show all messages.
 
/Q1
  Suppress the program logo; all warnings will still appear.
 
/Q2
  Suppress the program logo, as well as warnings about invalid combinations of options. Warnings about missing files will still appear, as will the warning about lines that were broken in the middle, possibly misssing matches (see the /W option).
 
/Q3
  Suppress the program logo and all warnings. This level is not recommended unless you definitely know what you're doing, because you might miss important error messages about your input files.

Fatal error messages (those that force GREP to stop execution) will always be displayed. Debug output will also be displayed, if you set the /D option, regardless of the /Q setting.

For compatibility with earlier releases of GREP, you can still specify a plain /Q option with no level number, and it means /Q3 (suppress all warnings), just as in earlier releases. A plain /Q after an earlier /Q or /Qlevel re-enables all messages.

(The /Q option is available only in the registered version.)

/Z -- Reset All Options

      

Reset all options to their default values.

If you use the /Z option on the command line, any options in the environment variable will be disregarded, and so will any preceding options on the command line. This can be useful in batch files, to make sure that the action of GREP is controlled only by the options on the command line, and not by any settings in the environment variable.

The /Z option is the only single-letter option whose effect can't be reversed. If you use /Z more than once, GREP disregards the environment variable and all command-line options up through the last /Z.

/0 or /1 -- Set ERRORLEVEL to Show Whether Matches Were Found

      

These options control the values that GREP returns in the DOS error level. /0 returns 0 if there are matches or 1 if there are no matches; /1 returns 1 for matches or 0 for no matches. For more details, see Return Values in the user guide.

/? -- Display Help

      

Display a help message and summary of options and regex forms, then exit with no further processing. The help message is longer than 25 lines, so you probably want to pipe it through more or a similar filter, like this:

        grep /? | more

You can also redirect this information. For instance,

        grep /? >prn
will send the help text to the printer.

Environment Variable

If you use certain options frequently, with the registered version of GREP you can put them in the ORS_GREP environment variable. You have the same freedom as on the command line: leading slashes or hyphens, space separation or options run together, caps or lower case.

Only options can be put in the environment variable. If you want to store a regex, put it in a file and put /Ffile in the environment variable.

If you have some options in the ORS_GREP environment variable but you don't want one of them for a particular run of GREP, you don't have to edit the environment variable. You can make most changes on the command line, like this:

Extended example: Suppose you have set the environment variable as

        set ORS_GREP=/UNI
because you usually run GREP with UNIX-style output (/U option) with line numbers (/N option), ignoring case of letters (/I option).

If you want to run case sensitive for one particular run of GREP, simply put the /I option on the command line to reverse the setting from the environment variable.

If you don't know what's in the environment variable, perhaps because you're on an unfamiliar machine, either put the /Z option on the command line followed by the options you want, or set them positively by specifying for instance /L+.

Finally, if you want to turn an option definitely off, without regard to the environment variable, turn it on and then toggle it. To turn off line numbers, /N+N will always work, whether N was set in the environment variable or not. (/N- might be more logical, but for historical reasons options with leading minus signs are allowed to run together, and such a usage would conflict.)

If you're ever in doubt about the interaction of options between the command line and the environment variable, simply add "/d- | more" to the end of your command line and GREP will tell you all the option settings in effect and how it interprets your regex.


 

Regular Expressions (Regexes)


A regular expression or regex is a pattern of characters that will be compared to lines from one or more input files. A line from an input file is a match if the line, or part of it, agrees with the pattern in the regex.

A regex can be a simple text string, like mother, or it can include a bunch of special characters to express possibilities like "repeated" and "any of these characters or substrings". (If you want to search only for simple strings, use the /E0 option and ignore all this regex stuff.)

The rest of this reference manual tells you how to construct regular expressions. So much detail can be overwhelming on a first or even a second reading; therefore, you may want to begin by ignoring everything about extended regexes. You may also want to refer back to the above examples periodically. On the other hand, if you're already comfortable with regexes, you'll find additional material and tips in Mastering Regular Expressions by Jeffrey Friedl (O'Reilly & Associates).

Overview

A regex is a mix of normal characters and special characters. Here's an overview of the special characters, with hyperlinks to the places in this reference manual where they are discussed in detail.

The following characters are special if they occur outside of square brackets:

The following characters are special if they occur within square brackets:

Otherwise, every character is a normal character. Any of the above characters also becomes a normal character if preceded by a backslash, as will be shown below.

Examples

Basic and Extended Regexes

GREP offers two levels of regular expressions. Extended regexes have the greater power, but at a price: in some circumstances they can be slow in searches. Basic regexes offer a "core subset" of the regex capabilities. The discussion below will mark certain features as "extended regex"; all others are common to basic and extended regexes.

You'll see a full rundown on extended regexes below, but some of the features they offer are | alternatives, { } quantifiers, and ( ) subexpressions. If you want to use extended regexes, specify the /E2 option, available only in GREP32.

Normally, GREP treats your regexes as basic, since that's the only kind there was before release 6.0. Special characters listed below as "extended regex" are treated as normal characters in basic regexes.

Support for extended regular expressions, added to GREP release 6.0, is provided by the PCRE library package, release 3.5, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England. The primary site for PCRE is <ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/>. (GREP's price did not change upon the addition of this new capability.)

Compatibility Note

Different utilities define regexes differently; the following sections tell you how this GREP defines them. You can find fascinating tables of different interpretations in Jeffrey Friedl's book Mastering Regular Expressions (pages 63 and 182-183 of the 1997 edition).

A note to UNIX or Vim veterans: This GREP follows the Perl or egrep scheme, which uses | not \| for alternatives, ( ) not \( \) for subexpressions, \b not \< \> for word boundaries. Be alert to differences from the scheme you may know.

Normal Character (any regex)

Any normal character matches itself. Example: the regex abc matches input lines that contain the three consecutive characters a, b, and c.

You can use any character from space through character 255. When using 8-bit characters or certain special characters on the command line, see Special Rules for the Command Line below.

If you specify the /I option, any letter in your regex will match both the upper and lower case of that letter. (By default, only unaccented English letters A-Z and a-z are affected by the /I option. In GREP32, you can use the /M option to select a mapping that includes all letters.)

If you want to match a special character, you must precede it with a backslash \ in your regex. Example: to search for the string "^abc\def", you must put backslashes before the two special characters to make GREP treat them as normal characters and not give them special meanings: use \^abc\\def as your regex.

. for Any Character (any regex)

The period (full stop or dot) in a regex normally matches any character. Example: o.e matches lines that contain "ode", "one", "ope", "ore", and "owe". Of course it also matches lines that contain "oae", "o e", "o$e", "o´e", and so on.

If you want to match a literal period, for instance to search for "3.50", you need a backslash before the period in your regex to turn it into a normal character (3\.50).

In binary mode, the period matches any character including ASCII 0, Ctrl-Z, carriage return, and line feed. In text mode, ASCII 0 and the rest of the line are ignored; Ctrl-Z is end of file, and carriage return or line feed marks a line break.

A period between square brackets is just a normal character. For example, [.?!] matches any of the characters that end an unquoted sentence.

* or + for Repetition (any regex)

A plus sign (+) after a character, character class, subexpression, or back reference matches one or more occurrences; an asterisk (*) matches zero or more occurrences. In other words, the plus sign means "one or more" and the asterisk means "any number, including none at all".

(The note on greediness below applies to * and + in extended regexes.)

Example: Big.*night matches lines that contain "Big" followed by any number of any character followed by "night". Since "any number" includes "zero", that regex also matches lines that contain "Bignight".

Examples: snor+ing matches lines that contain "snoring", "snorring", "snorrring", and so on, but not "snoing". snor*ing matches "snoing", "snoring", and so on.

Used with a character class or character type, the plus sign and asterisk match any multiple characters in the class, not only multiple occurrences of the same character. For instance, sno[rw]+ing matches lines that contain "snowing", "snorwing", "snowrring", and so on.

Obligatory example: [A-Za-z_][A-Za-z0-9_]* matches a C or C++ identifier, which is an English letter or underscore, possibly followed by any number of letters, digits, and underscores. (The square brackets enclose character classes.)

Anything followed by * will always match. For example, the regex .* would match any number of characters including none, meaning that empty and non-empty lines would match. .* is more useful as part of a regex.

Between square brackets, + and * are normal characters. For instance, the regex [*+] will match either of the two common footnote characters.

? for Optional Match (extended regex)

In an extended regex only, the question mark after a character, character class, subexpression, or back reference indicates that the construct is optional. For example, the extended regex move?able matches lines containing "moveable" and "movable", but not "moveeable".

(The note on greediness below applies to ? in extended regexes.)

Anything followed by ? will always match. For example, the extended regex .? would match one character or none. Since every line contains a string of no characters (whether or not there are some additional characters on the line), every line would be a match. ? is more useful as part of an extended regex.

? is a normal character when it occurs within square brackets in an extended regex; it's always a normal character in a basic regex.

{...} for Repetition (extended regex)

In an extended regex only, you can use braces (also called curly braces) after a character, character class, subexpression, or back reference to specify repetition. The general form is {minimum,maximum} where both numbers are in the range 0 to 65535 and minimum <= maximum. Here are the three variations:

Three special cases of quantifiers have already been discussed. The asterisk * is equivalent to {0,}; the plus sign + is equivalent to {1,}; and the question mark ? is equivalent to {0,1}.

The braces are normal characters in other contexts. For instance, {,3} is just four normal characters because it doesn't match any of the three variations listed above. The braces are always normal characters inside square brackets, and the right brace on its own is always a normal character. Both braces are normal characters anywhere in a basic regex.

Greedy Quantifiers (extended regex)

(This is an advanced topic, probably best skipped on the first few readings of this reference manual.)

In an extended regex, you can control the "greediness" of the quantifiers { }, ?, *, and +. Normally they are all greedy, but you can make them ungreedy by putting a question mark after them: { }?, ??, *?, and +?.

What does it mean to say that quantifiers are greedy? It means that they will match as much as possible of an input line without causing the regex to fail. For example, consider the extended regex <.*> when it tries to match against some HTML like this:

        <hr><p align=center>some text <b>in bold</b> and some not

Since the * is greedy, it consumes every character from the "h" in column 2 to the "b" one position before the end. If the * were ungreedy, it would consume the minimum characters that would let the regex match, namely just the initial "hr". To make a quantifier ungreedy, put a question mark ? after it.

Example: You can illustrate the difference by copying the above line into file TEST and then matching it against the extended regexes <.*> and <.*?> which differ by just the question mark. Since < and > are special characters to DOS, you would either use the /F option to enter those extended regexes, or enter them on the command line with encoding sequences like this:

        grep /e3 \x3c.*\x3e  <test
        grep /e3 \x3c.*?\x3e <test

(\x3c and \x3e are the encoding sequences for the < and > characters.)

The first extended regex is greedy (by default), and the matching string is

        <hr><p align=center>some text <b>in bold</b>

In the second one, the * quantifier is made ungreedy by the following ?, and the matching string is

        <hr>

Unless you're using the /J option to display just the matching part of the line, why do you care about greediness? To be honest, you probably don't. In ordinary extended regexes, it doesn't make much difference whether a quantifier is greedy or not, since a greedy quantifier consumes as many characters as possible without causing the regex to fail and an ungreedy one consumes as few as possible but again without causing the regex to fail. Either way, if a match can be squeezed out it will be.

Where greediness can make a difference is in in more complex extended regexes with capturing subexpressions and back references.

[...] for Character Class (any regex)

To match any one of a group of characters, enclose them in square brackets [ ]. Examples: [aA] matches an upper- or lower-case letter A; sno[wr]ing matches lines that contain "snowing" or "snoring".

Immediately after the opening [ or [^, a right square bracket is just a normal character: []abc] matches the character ], a, b, or c. A right square bracket after a left square bracket and at least one other character ends the character class, though as always you can use a backslash to make it normal: [abc\]] is the same character class as the preceding. Finally, a right square bracket with no preceding left square bracket is a normal character.

In an extended regex, certain abbreviations and class names are available for commonly used classes.

        - for Character Range (any regex)

You can indicate a character range with the minus sign or hyphen (-, ASCII 45).

Examples: [0-9] will match any single digit, and [a-zA-Z] will match any English letter. To match any Western European letter (under most recent versions of Windows, in North America and Western Europe), use the basic regex

        [a-zA-ZÀ-ÖØ-öø-ÿ]
(Note 1. That regex will work fine on the command line with GREP16 or in a file [/F option] with either GREP. But to enter it on the command line with GREP32, you must use numeric sequences for the 8-bit characters; see Special Rules for the Command Line below.)

(Note 2. In GREP32, you're better off to set an appropriate character mapping with the /M option and use an extended regex. The named character class [[:alpha:]] can then replace the above mess.)

A character class can contain both ranges and single characters, mixed any way as long as each range within the class is written low-high: T-f is fine since they are ASCII 84 and 102, but f-T is invalid.

There is no difference to GREP between writing out all the characters in a range and using the minus sign to abbreviate: [pqrsty] and [ytsrpq] and [yp-t] and [yq-stp] are just some of the ways to write the same class.

The minus sign is a normal character outside square brackets. It is also a normal character if it occurs at the beginning or end of a class (immediately after the opening [ or [^ or immediately before the closing ] character).

        Negative Character Class (any regex)

To match any character that is not in a class, use square brackets with a caret or circumflex, (^, ASCII 94).

Examples: [^0-9 ] matches any character except a digit or a space, and the[^a-z] matches "the" followed by anything except a lower-case letter.

Note: The negative character class matches any character not within the square brackets, but it does match a character. (It might help to read it as "a character that isn't ..." rather than just "not ...".) For instance, the[^a-z] matches "the" followed by something other than a lower-case letter, but it does not match "the" at the end of a line because then "the" is not followed by any characters. For further explanation, please see the lengthy example under the rules for ^ and $, below.

The caret has a different meaning when it occurs outside square brackets. And when it occurs within square brackets but not immediately after the opening left square bracket, the caret is a normal character.

        Character Class and Case-Blind Matching (any regex)

If you use the /I option to specify case-blind matching, then the character class [abc] matches an upper-case or lower-case a, b, or c. With the /I option in effect, [^abc] matches any character except A, a, B, b, C, or c.

        Character Class Names (extended regex)

Extended regexes support POSIX character class names, such as [:lower:] for any lower-case letter and [:^lower:] for any character except a lower-case letter. Notice that you can negate a character class name by putting a caret after the first colon.

These are not character classes, but special names that you can insert within square brackets as (part of) a character class. For instance, the extended regex [AB[:^alpha:]] matches any non-alphabetic character or a capital A or B.

Here is the complete list of POSIX character class names. Remember that they occur inside the normal square brackets for a character class. Also remember that they must be surrounded by [: :], or [:^ :] for negation.

  word   any "word" character (letters, digits and underscore, same as \w)
  alnum   any letter or digit
  alpha   any letter
  lower   any lower case letter
  upper   any upper case letter
  digit   any decimal digit (same as \d)
  xdigit   any hexadecimal digit, decimal digits plus A-F and a-f
  space   any white space character (same as \s)
  graph   any printing character, excluding space
  print   any printing character, including space
  punct   any printing character, excluding letters and digits and the space character
  ascii   any ASCII character (see note below)
  cntrl   any control character

The exact definitions of the above classes will depend on the character mapping in effect. In the default C locale, the above classes match only 7-bit characters (character positions 0-127); in other mappings, 8-bit characters also match. You can set the character mapping with the /M option. Use the supplied file TEST255 to test the meaning of any character class in your selected locale; see examples in the supplied DEMO.BAT file.

^ and $ for Start and End of Line (any regex)

A caret or circumflex (^, ASCII 94) at the start of a regex means that the pattern starts at the beginning of a line in the file(s) being searched. A dollar sign ($, ASCII 36) at the end of a regex means that the pattern ends at the end of a line in the file(s) being searched.

The caret and dollar are sometimes called anchors because they anchor a regex to the start or end of a line (or both). They're also the two best-known examples of assertions, constructs that match a condition rather than a character.

Examples:

You should probably use ^ and $ only in text mode or record-oriented binary mode. Also, they make sense only at the beginning and end of your regex. For those who prefer to live life on the edge, here are the full rules:

Basic regexExtended regex
With line-oriented text or record-oriented binary
(/R0 or /R2)
^ at the start of a basic regex matches the start of a line or record. $ at the end of a basic regex matches the end of a line or record. Both ^ and $ are normal characters everywhere else except within square brackets. ^ and $ outside square brackets always mean start and end of a line or record. If you misplace them, your extended regex won't match anything.
With free-form binary
(/R3)
In a basic regex, ^ and $ outside square brackets don't match anything useful. In an extended regex, ^ and $ outside square brackets match a newline (ASCII 10).
When GREP senses file format
(/R-1 or /R-2)
Don't use ^ and $ in a regex with the /R-1 or /R-2 option. If you do use them, they work correctly in text files, but in binary files they match the start and end of every buffer, arbitrary file positions that are not likely to be useful.

It's a historical artifact that the rules for basic and extended regexes are not quite the same.

       Lengthy Example

Suppose you want to find the word "the" in a file, whether in caps or lower case. You can use the /I option to make the search case blind, and concentrate on constructing the regexes.

At first glance, [^a-z]the[^a-z] seems adequate: anything other than a letter, followed by "the", followed by anything but a letter. That lets in "the" and rules out "then" and "mother". But it also rules out "the" at the beginning or end of a line. (Remember that a negative character class does insist on matching some character. Read it as "any character other than ..." rather than as simply "not...".) The solution with basic regexes requires four of them, for "the" at the beginning, middle, or end of a line, or on a line by itself:

        ^the[^a-z]
        [^a-z]the[^a-z]
        [^a-z]the$
        ^the$
To search for just the occurrences of the word "the", put those four lines in a file and then use the /F option on GREP.

But this becomes much easier if you use the power of extended regular expressions (/E2 option). You can search for the word "the", not embedded in larger words, with one extended regex:

        \bthe\b

Read this as "a word boundary, followed by t-h-e, followed by a word boundary." As you would expect, start and end of line count as word boundaries.

(Technically, there could be a problem with the above regular expression: it would not match "the6" or "the_" since the underscore and the digits are considered "word characters". It's not likely you'd get such sequences in a text file, but if you want to be absolutely precise you should use alternatives with subexpressions in the extended regex

        (^|[^a-z])the($|[^a-z])

Again assuming the /I option, you would read this as "either start of line or a non-letter, followed by t-h-e, followed by either end of line or a non-letter". It's a little bit nasty, but still it's probably easier than typing in four regexes.

| for Alternatives (extended regex)

In an extended regex only, the vertical bar (|, ASCII 124) separates two or more alternatives. The extended regex will match lines that contain any of the alternatives. It is legal for an alternative to be empty, and this can be useful in subexpressions.

Example: the extended regex cat|dog will match any input line that contains the string "cat" or "dog".

If you want alternatives for part of an extended regex, use parentheses or round brackets to form a subexpression. See the Lengthy Example immediately above, and more examples in the section on subexpressions.

If you are matching alternatives that must occur at the start of end of a line, the anchor needs to be in each alternative. Example: to match lines that start with "cat" or "dog", use ^cat|^dog as your extended regex. Another way to express that is with a subexpression, ^(cat|dog).

Efficiency note: Alternatives can be slower than character classes. The extended regex bar|bat or ba(r|t) is logically equivalent to the basic or extended regex ba[rt], but the latter will generally execute faster. You may or may not notice any time difference, depending on the speed of your computer and the size of the files that you're searching.

Caution: The vertical bar | has special meaning on the DOS command line. If your operating system doesn't let you override that meaning, use the /F- option to enter it from the keyboard, or see Backslash for Character Encoding below.

(...) for Subexpressions (extended regex)

In an extended regex only, the parentheses or round brackets have several uses, but only two will be discussed in this reference manual.

The first use is straightforward: to set up alternatives as part of an extended regex. For example, the extended regex

        the quick (brown fox|white rabbit)
matches lines containing either "the quick brown fox" or "the quick white rabbit". Here's another example, adapted from the PCRE manual page: cat(aract|erpillar|)s matches lines containing "cataracts", "caterpillars", or "cats".

The second use of parentheses is to set up a "capturing subpattern", which can be referred to with a "back reference" using a backslash; see below.

For advanced topics not covered in this reference manual, please see Final Thoughts, below.

Parentheses are not special inside square brackets, or anywhere in a basic regex.

The Backslash \

The backslash (\) has quite a number of uses.

        Backslash as Escape (any regex)

First and simplest, when the backslash precedes any non-alphanumeric character it makes that character normal. For example, the regex 2+2 normally matches a string of two or more 2s. (The 2+ construct means "one or more occurrences of the character 2".) If you want to match that middle character as an actual plus sign, you must "escape" it with a backslash: 2\+2.

If you want to match a backslash itself, you escape it in the same way. For example, the regex ^c:\\ matches every line that begins with "c:\".

The backslash functions as an escape both inside and outside of square brackets. If you are not sure when a non-alphabetic character like ] or $ is special and when it is not, just precede it with a backslash and it will be a normal character, even if it would have been normal anyway.

Example: To match any of the four signs of arithmetic, you might write the regex [+-*/]. But that minus sign has a special meaning inside square brackets. To treat it as a normal character you must escape it with the backslash, like this: [+\-*/].

This is the only use of the backslash in basic regexes; the rest all relate to extended regexes.

        Backslash for Character Types (extended regex)

Many regexes involve a type of character: digit (or not), blank (or not). While you can always use ordinary character classes, in an extended regex you can also use these shortcuts on their own or as part of a character class:

  \w   any "word" character, meaning any letter or decimal digit or an underscore
  \W   any character except a "word" character
  \d   any of the decimal digits
  \D   any character except a decimal digit
  \s   any whitespace character: tab, space, and so on
  \S   any character except a whitespace character

The exact definitions of the above types will depend on the character mapping in effect. In the default C locale, no 7-bit characters (characters 128-255) are considered as possible "word" characters, digits, or whitespace; in other mappings, some 8-bit characters also match. You can set the character mapping with the /M option. Use the supplied file TEST255 to test the meaning of any character type in your selected locale; see examples in the supplied DEMO.BAT file.

Example: To scan a file for four-digit numbers, your regex could repeat the \d four times or use curly braces: \d\d\d\d or \d{4}.

Did you spot the problem with the preceding example? Yes, either of those extended regexes matches lines containing four-digit numbers. But it also matches lines containing five-digit numbers, since a five-digit number contains four consecutive digits. One way to match numbers of exactly four digits is to mark them as being preceded by start or line or a non-digit, and followed by end of line or a non-digit:

        (^|\D)\d{4}($|\D)
Of course, if you know something about the files you're scanning you may not need to get so elaborate.

Example: To scan for four hexadecimal digits, use the extended regex

        [\da-fA-F]{4}
(This one has the same problem as the previous example: it also matches five or more hex digits. Fixing it is left as an exercise for the reader!)

        Backslash for Assertions (extended regex)

The assertions in this section look like the above character types, but there's an important difference. The difference is that while a character type matches a character of specified type, an assertion matches a position in the line and doesn't "consume" a character. (You already know two examples of assertions, namely the anchors ^ and $.)

  \b   word boundary, namely the transition between a word and a non-word character or vice versa, or the beginning or end of line if the adjacent character is a word character
  \B   not a word boundary
  \A   same as ^: start of input line (text mode) or record (binary mode)
  \z, \Z   same as $: end of input line (text mode) or record (binary mode)

These assertions are not valid inside square brackets, and in fact \b has a different meaning inside a character class; see Backslash for Character Encoding, below.

        Backslash for Back References (extended regex)

Outside square brackets, a backslash followed by a number > 0 is interpreted as a back reference to a capturing subpattern in the regex. For example, \6 refers to the sixth capturing subpattern in the extended regex.

Example (from the PCRE man page): the extended regex

        (sens|respons)e and \1ibility
matches "sense and sensibility" or "response and responsibility" but not "sense and responsibility". A back reference always refers to the actual matching subpattern in this particular instance, not to just any alternative.

Example: U.S. toll-free area codes are 800, 888, 877, 866 (and soon 855). The regex 8[08765]{2} would be wrong because it would match strings like "867" and "808". You need a back reference to ensure that the third digit is the same as the second: 8\([08765]\)\1 is your regex. That says you must have an 8, followed by 0, 8, 7, 6, or 5, followed by a second occurrence of the same digit.

A "back reference" can actually be a forward reference: any of \1 through \9 refers to the first through ninth capturing subpattern in the extended regex, even if that subpattern comes after the "back reference" in the regex. But \10 and greater can refer only to subpatterns that precede the back reference. If something looks like a back reference but the number is greater than 9 and greater than the number of capturing subexpressions to the left of it, it is read as an encoded character in octal.

        Backslash for Character Encoding (extended regex)

The last use of backslash documented in this reference manual is to encode certain characters, either non-printing characters or those that DOS doesn't allow on the command line. These rules are nasty, and I recommend you use the /F option, if possible, to enter a messy regex from the keyboard or in a file.

(Please note that these rules for extended regexes are quite different from the Special Rules for the Command Line that apply to basic regexes. It's an unfortunate incompatibility, but neither can be changed because PCRE is a supplied library for extended regexes and users rely on existing behavior of basic regexes.)

Except as noted, each of these sequences has the indicated meaning anywhere in an extended regex:

  \a   "alarm", the BEL character, ASCII 7
  \b   "alarm", the backspace character, ASCII 8, but only inside square brackets. Outside square brackets it is an assertion.
  \cx   a control character. If x is a letter, it's straightforward: \cb and \cB are both Control-B, ASCII 2. If x is not a letter, it is XORed with 64 (hex 40).
  \e   escape, ASCII 27
  \f   form feed, ASCII 12
  \n   "newline", line feed, ASCII 10. This character will never be seen in a text file, since it marks a line break. It can occur in a binary file.
  \r   carriage return, ASCII 13. This character will never be seen in a text file, since it marks a line break. It can occur in a binary file.
  \t   tab, ASCII 9
  \xhh   character with the given hex code hh (zero, one, or two digits). Examples: \x7c or \x7C is hex 7C (ASCII 124), the | character. \x or \x0 or \x00 is the NUL character, ASCII 0.
  \0dd   octal number of one to three digits. \032 is Control-Z, ASCII 26.
  \ddd   This sequence, one to three digits where the first one is not zero, is complicated. Outside square brackets, it's read as a decimal number and is interpreted as a back reference (above) if possible. Otherwise, or always inside square brackets, it's read as an octal number and the least significant 8 bits are taken as its value. Examples: \7 is a back reference. \11 is a back reference if there have already been eleven capturing subpatterns; otherwise it's octal 11, ASCII 9, the tab character.

Some Final Thoughts (extended regex)

An extended regex can have further special constructs beyond those documented here. If you need more, please refer to the full documentation at <http://www.pcre.org/man.html>. That document contains information about writing programs using PCRE, as well as complete specification of the PCRE (extended) regexes. For completion, a copy of it with just the information relevant to GREP users is provided as document PCRE.HTM.

Advanced users may want to consult that document for the following topics:

Special Rules for the Command Line

The special rules below get around three problems:

Therefore GREP defines some special sequences starting with a backslash \ to let you get problem characters into your regex.

These rules date back to a much earlier release of GREP, and while there are better ways available now, the rules are maintained for users who have come to rely on them. Beginning with release 6.0, you can turn the special rules on or off.

When Do You Need the Special Rules?

You need them only when

When you select extended regexes (/E2 option), you probably don't want the special rules given below. Extended regexes come with their own ways of using a backslash for character encoding.

When you have a messy regex, I recommend you use the /F- or /Ffile option to enter it from the keyboard or store it in a file. This gets around the whole problem.

How Do You Turn the Special Rules on or Off?

When you specify the /En option, you also turn the special rules on or off by following the number n with a backslash (\), or not.

Example: /E0 specifies simple string search and turns the special rules off; /E0\ specifies simple string search and turns the special rules on.

If you never specify an /E option, GREP takes your regex as a basic regex and turns on the special rules. This is the same as the /E1\ option, and it was the only choice until GREP release 6.0. If you want GREP to work with basic regexes but without the special rules, specify the /E1 option.

What Exactly Are the Special Rules?

If your regex begins with a minus (-) or slash (/), GREP will try to interpret it as an option. Example: if you're searching for the string "-in-law", GREP will think you're trying to turn on the options /I, /N, and so on. To avoid this problem, use a leading backslash (\-in-law).

If your regex contains certain special characters like <, =, and |, DOS will give those characters their special DOS meaning and GREP will never see them. So you must use special "escape sequences" to represent those characters in a regex on the command line, as follows:

instead of you can use any of
< (less) \l \60  \0x3C \074
> (greater) \g \62  \0x3E \076
| (vertical bar) \v \124 \0x7C \0174
" (double quote)     \" \34  \0x22 \042
, (comma) \c \44  \0x2C \054
; (semicolon) \i \59  \0x3B \073
= (equal) \q \61  \0x3D \075
(space) \s \32  \0x20 \040
(tab) \t \9   \0x09 \011
(escape) \e \27  \0x1B \033

You can enter any character as a numeric sequence, not just the special characters in the above list. Use decimal, hex (leading 0x), or octal (leading zero). Example: capital A would be \65, \0x41, or \0101.

ASCII 0 is not legal in a basic regex: it causes the rest of the regex to be silently ignored. Either code something like [^\1-\255] ("any character except ASCII 1 to 255") in your basic regex, or use an extended regex.

Finally, if your regex contains 8-bit characters, Microsoft's 32-bit startup code (not DOS) will translate these characters from a DOS character set to a Windows character set, which is probably not what you want. To avoid this problem, you can use the numeric sequences to enter characters. Example: In a regex on the command line, instead of actually typing the character é, enter it as \233 or \0xE9 or \0351. This problem affects GREP32, not GREP16.

Remember, the rules in this section are required only to get around problems with special characters in basic regexes on the command line. These workarounds are not needed, and don't work, when you use the /F option to enter regexes in a file or from the keyboard.

Extended regexes have their own rules; see Backslash for Character Encoding, above. Yes, you can have the special rules with an extended regex (/E2\ option), if you really like using lots of backslashes.

When the special rules are in effect, you can find out how GREP applied them by using the /D option and looking for the "massaged" string or regex.


[ back to user guide ]