https://github.com/sirwumpus/ioccc-ag
Best Utility (grep) - IOCCC 1993 Winner
https://github.com/sirwumpus/ioccc-ag
c grep ioccc posix regular-expressions
Last synced: about 1 year ago
JSON representation
Best Utility (grep) - IOCCC 1993 Winner
- Host: GitHub
- URL: https://github.com/sirwumpus/ioccc-ag
- Owner: SirWumpus
- License: 0bsd
- Created: 2017-05-10T22:40:44.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2023-06-25T11:21:30.000Z (about 3 years ago)
- Last Synced: 2025-04-28T11:24:46.721Z (about 1 year ago)
- Topics: c, grep, ioccc, posix, regular-expressions
- Language: C
- Homepage:
- Size: 21.5 KB
- Stars: 10
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
ag - Anthony's Grep
------------------
## 1. SYNOPSIS
ag pattern [file...]
## 2. DESCRIPTION
The AG utility shall search input files, selecting lines matching a
pattern. An input line shall be selected if the pattern, treated as a
subset of Extended Regular Expressions as described by POSIX.2
Standard section 2.8.4, matches any part of the line; a null pattern
shall match every line. Each selected input line shall be written to
the standard output.
Regular expression matching shall be based on text lines, which are
separated by a `` character. A regular expression cannot
contain a `` and there is no way for a pattern to match a
`` found in the input.
## 3. OPTIONS
None.
## 4. OPERANDS
* `pattern` Specify one pattern to be used during the search for
input. See Extended Regular Exprssions below on what
constitutes a valid pattern.
* `file` A pathname of a file to be searched for the pattern.
If no file operands are specified or the file operand
is dash (`-`), the standard input shall be used.
## 5. EXTERNINAL INFLUENCES
### 5.1. STANDARD INPUT
The standard input shall be used only if no file operands are
specifed. See Input Files.
### 5.2. INPUT FILES
The input files shall be text files.
### 5.3. ENVIRONMENT VARIABLES
No environment variables are used.
### 5.4. ASYNCHRONOUS EVENTS
Default.
## 6. EXTERNAL EFFECTS
### 6.1. STANDARD OUTPUT
For each selected input line from a file, a single output line shall
be written:
"%s:%s\n", file, line
Standard input will have the filename dash (`-`).
### 6.2. STANDARD ERROR
Standard error is used for diagnostic messages only.
### 6.3. OUTPUT FILES
None.
## 7. EXTENDED DESCRIPTION
### 7.1. EXTENDED REGULAR EXPRESSIONS
#### 7.1.1. MATCHING A SINGLE CHARACTER
An ERE ordinary character, a special character preceded by a
backslash, or a period shall match a single character. A bracket
expression shall match a single character. An ERE matching a single
character enclosed in parentheses shall match the same as the ERE
without parentheses would have matched.
##### 7.1.1.1. ERE ORDINARY CHARACTERS
An oridinary character is an ERE that matches itself. An ordinary
character is any character in the supported character set, except for
the ERE special characters listed in 7.1.1.2. The interpretation of
an ordinary character preceded by a backslash (`\`) is the ordinary
character.
##### 7.1.1.2. ERE SPECIAL CHARACTERS
An ERE special character has special properties in certain contexts.
Outside of those contexts, or when preceded by a backslash, such a
character shall be an ERE that matches the special character itself.
The extended regular expression special characters and the contexts in
which they have their special meaning are:
`. [ \ (` The period, left-bracket, backslash, and left-
parenthesis are special except when used in a
bracket expression. Outside a bracket
expression, a left-parenthesis immediately
followed by a right-parenthesis will be
treated as a null expression.
`)` The right-parenthesis is special when matched
with a preceding left-parenthesis, both
outside of a bracket expression.
`* ?` The asterisk and question-mark are special
except when used in a bracket expression.
Results are undefined if this character
appears first in an ERE, or immediately
following a vertial-line, circumflex, or
left-parenthesis.
`|` The vertical-line is special except when used
in a bracket expression. A vertical-line
appearing first or last in an ERE or
immediately following a vertical-line or
left-parenthesis, or preceding a
right-parenthesis, produces a null expression.
`^` The circumflex shall be special when used as
an anchor or, as the first character of a
bracket expression.
`$` The dollar-sign shall be special when used as
an anchor.
##### 7.1.1.3 PERIODS IN ERE
A period (`.`), when used outside of a bracket expression, is an ERE
that shall match any character in the supported character set except
NUL.
#### 7.1.2. ERE BRACKET EXPRESSION
A bracket expression (an expression enclosed in square brackets, []),
is an ERE that matches a single character contained in the nonempty
set of characters represented by the bracket expression.
The following rules and definitions apply to bracket expressions:
1) A bracket expression is either a matching list expression or a
nonmatching list expression. It consists of one or more
characters and/or range expressions. The right-bracket (`]`)
shall lose its special meaning and represent itself in a bracket
expression if it occurs first in the list [after an initial
circumflex (`^`), if any]. Otherwise, it shall terminate the
bracket expression. The special characters period (`.`),
asterisk (`*`), left-bracket (`[`), and backslash (`\`) shall
lose their special meaning within a bracket expression.
2) A matching list expression specifies a list that shall match any
one of the expressions represented in the list. The first
character of the list shall not be the circumflex. For example,
[abc] is an ERE that matches any of a, b, or c.
3) A nonmatching list expression begins with a circumflex (`^`),
and specifies a list that shall match any character except for
the expressions represented in the list after the leading
circumflex. For example, `[^abc]` is an ERE that matches any
character except a, b, or c. The circumflex shall have this
special meaning only when it occurs first in the list,
immediately following the left-bracket.
4) A range expression represents the set of characters that fall
between two elements in the current collation sequence,
inclusive. It shall be expressed as the starting point and the
ending point separated by a hyphen (`-`).
In the following, all examples assume the collation sequence
specified for the POSIX Locale (a.k.a ASCII Character Set),
unless another collation sequence is sepecifically defined by
the hardware.
The starting range point and the ending range point shall be a
collating element (character). The ending range point shall
collate higher than the starting range point; otherwise the
expression shall be treated as invalid.
The interpretation of a range expression where the ending range
point is the starting range point of a subsequent range
expression is undefined.
The hyphen character shall be treated as itself if it occurs
first (after an initial `^`, if any) or last in the list, or as
an ending range point in a range expression. As examples, the
expressions `[-ac]` and `[ac-]` are equivalent and match any
of the characters a, c, or -; the expressions `[^-ac]` and
`[^ac-]` are equivalent and match ant character except a, c, -;
the expression `[%--]` matches any of the characters between
`%` and `-` inclusive; the expression `[--@]` matches any of
the characters between `-` and `@` inclusive; and the expression
`[a--@`] is invalid, because the letter `a` follows the symbol
`-` in the POSIX Locale.
#### 7.1.3. ERE MATCHING MULTIPLE CHARACTERS
The following rules shall be used to construct EREs matching multiple
characters from EREs matching a single character:
1) A concatenation of EREs shall match the concatenation of the
character sequences matched by each component of the ERE. A
concatenation of EREs enclosed in parentheses shall match
whatever the concatenation without parentheses matches. For
example, both the ERE cd and the ERE (cd) are matched by the
third and forth character of the string abcdefabcdef.
2) When an ERE matching a single character, or a concatention of
EREs enclosed in parentheses is followed by the special
character asterisk (`*`), together with that asterisk it shall
match what zero or more consecutive occurences of the ERE
would match. For example, the ERE `b*c` matches the first
character in the string cabbbcde, and the ERE b*cd matches the
third through seventh characters in the string
cabbbcdebbbbbbcdbc. And, `[ab]*` and `[ab][ab]` are equivalent
when matching the string ab.
3) When an ERE matching a single character or an ERE enclosed in
parentheses is followed by the special character question-mark
(`?`), together with that question mark it shall match what zero
or one consecutive occurences of the ERE would match. For
example, the ERE `b?c` matches the second character in the
string acabbbcde.
The behaviour of multiple adjacent duplication symbols (*, ?) produces
undefined results.
#### 7.1.4. ERE ALTERATION
Two EREs separated by the special character vertical-line (`|`) shall
match a string that is matched by either. For example, the ERE
`a((bc)|d)` matches the string abc and the string ad. Single
characters, or expressions matching single characters, separated by
the vertical-line and enclosed in parentheses, shall be treated as an
ERE matching a single character.
#### 7.1.5. ERE PRECEDENCE
The order of precedence shall be as shown in the table below, from
high to low:
```
escaped characters \
bracket expression [ ]
grouping ( )
single-character-ERE duplication * ?
concatenation
anchoring ^ $
alteration |
```
For example, the ERE abba|cde matches either the string abba or the
string cde (because concatenation has a higher order of precedence
than alteration).
#### 7.1.6. ERE EXPRESSION ANCHORING
An ERE can be limited to matching strings that begin or end a line;
this is called anchoring. The circumflex and dollar-sign special
characters shall be considered ERE anchors in the following contexts:
1) A circumflex (`^`) outside a bracket expression shall anchor
the (sub)expression it begins to the beginning of a string;
such a (sub)expression can match only a sequence starting at
the first character of a string. For example, the EREs ^ab
and (^ab) match ab in the string abcdef, but fail to match in
the string cdefab, and the ERE a^b is valid, but can never
match because the character a prevents the expression ^b from
matching the first character of a string.
2) A dollar-sign (`$`) outside a bracket expression shall anchor
the (sub)expression it ends to the end of a string; such a
(sub)expression can match only a sequence ending at the last
character of a string. For example, the EREs ef$ and (ef$)
match ef in the string abcdef, but fail to match in the string
cdefab, and the ERE e$f is valid, but can never match because
the character f prevents the expression e$ from matching the
last character of a string.
## 8. EXIT STATUS
The following exit codes will be returned:
- 0 Successful match.
- 1 No match.
- 2 Usage error.
- 3 General error.
## 9. INSTALLATION
AG can be built on any system providing at least K&R C. It has been
tested on
* SunOS with GCC
* ATARI Mega ST with Sozobon C
* PC clone with Turbo C and WatCom C
* Interactive UNIX System V/386 release 3.2
For most machines, the compile command line should be
cc -O -o ag ag.c
PC class machines may require that medium or large memory model be used
and that ARRAY be set to 256. For Watcom C the command line would be
wcl -ml -DARRAY=256 ag.c
The value ARRAY represents the size of the compiled pattern array and
the size of the array used to record matched states. The default
value choosen should handle most patterns. ARRAY can be overridden on
the command line with -DARRAY=nn, where 255 < nn < INT_MAX.
## 10. REFERENCES
* Aho, Sethi, and Ullman, "Compilers - Principles, Techniques,
and Tools", Addison-Wesley, 86, ISBN 0-201-10088-6, chapter 3
* Kernighan and Plauger, "Software Tools in Pascal".
Addison-Wesley, 81, ISBN 0-201-10342-7, chapter 5
* Webb Miller, "A Software Tools Sampler",
Prentice Hall, 87, ISBN 0-13-822305-X, chaper 4
* POSIX.2 Standard, 2.8.4 Extended Regular Expressions, 4.28 Grep
## 11. BUGS
The expression `(a*)*` compiles but loops forever.
There is no check for trailing backslash (`\`) in the pattern.
There is no check for unbalanced brackets. Omitting a closing bracket
will generate a "Patern too long" error, which is not the real error.
## 12. NOTICES
Public Domain 1992, 1993 by Anthony Howe. No warranty.