https://github.com/bokic/textparser
TextParser is a high-performance C library that parses text(CFML and JSON for now) into Abstract Syntax Trees using regex grammars, designed for building syntax highlighters, language servers, as well as other code related tools.
https://github.com/bokic/textparser
ast ast-tree json parser pcre2 tokenization
Last synced: about 1 month ago
JSON representation
TextParser is a high-performance C library that parses text(CFML and JSON for now) into Abstract Syntax Trees using regex grammars, designed for building syntax highlighters, language servers, as well as other code related tools.
- Host: GitHub
- URL: https://github.com/bokic/textparser
- Owner: bokic
- License: mit
- Created: 2023-11-04T17:44:04.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2026-04-04T23:06:02.000Z (3 months ago)
- Last Synced: 2026-04-05T01:17:38.953Z (3 months ago)
- Topics: ast, ast-tree, json, parser, pcre2, tokenization
- Language: C
- Homepage:
- Size: 503 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Roadmap: ROADMAP.md
Awesome Lists containing this project
README
# TextParser [](https://deepwiki.com/bokic/textparser)
TextParser is a high-performance, extensible text parsing library written in C. It uses regular expressions to define language grammars and generates a hierarchical Abstract Syntax Tree (AST) for parsed documents.
The project currently provides robust support for CFML (ColdFusion Markup Language) and JSON, with a flexible architecture allows for easy addition of new language definitions.
## Features
- **High Performance**: Written in optimized C for fast parsing of large codebases.
- **Small Footprint**: The library is designed to be small and easy to integrate into other projects.
- **Minimal Dependencies**: The library has minimal dependencies (only crpe2 library for regex matching).
- **Regex-Based Grammars**: Define language syntax using flexible regular expressions.
- **Hierarchical AST**: Generates a structured tree of tokens (`textparser_token_item`) representing the code structure.
- **Syntax Highlighting Support**: Tokens track metadata like color, background, and flags, making it suitable for building syntax highlighters and editors.
- **Extensibility**: Language definitions are decoupled from the core parsing logic, constructed with JSON, and can be loaded at compile time (by generated header file) or at runtime (by loading JSON file).
- **Python Tooling**: Includes Python scripts for: prototyping and validation of the core algorithm, generation of C header files, and other parser verification tools.
## Project Structure
- **`src/`**: Core C library implementation (`textparser.c`, `adv_regex.c`).
- **`include/`**: Public header files (`textparser.h`).
- **`cli/`**: Command-line tool mainly for testing and demonstrating the library.
- **`definitions/`**: Language definitions (e.g., CFML, JSON).
- **`python/`**: Python bindings, prototypes, and validation tools (`validate_cfml.py`).
- **`tests/`**: Unit and integration tests.
- **`ccat/`**: Utilities for text processing (e.g., color cat).
## Build Instructions
### Prerequisites
- CMake (version 3.15 or higher)
- Ninja build system
- A C compiler (GCC/Clang)
### Building
You can use the provided build script for a quick start:
```bash
./build.sh
```
Alternatively, you can build using standard CMake commands:
```bash
cmake -B build -G Ninja
cmake --build build
```
Artifacts (libraries and executables) will be output to the `bin/` directory.
## Installation
### Arch Linux
`textparser` is available on the Arch User Repository (AUR). You can install it using an AUR helper like `yay`:
```bash
yay -S textparser
```
Or view the package details at [https://aur.archlinux.org/packages/textparser](https://aur.archlinux.org/packages/textparser).
## Usage
### CLI Tool
The `textparser` CLI tool can be used to parse files and visualize the resulting token tree.
```bash
bin/textparser path/to/file.cfm
```
### C Library Integration
To use TextParser in your C project, include `textparser.h` and link against `libtextparser`.
**Basic Example:**
```c
#include
#include
// Assume 'my_lang_definition' is defined elsewhere
extern const textparser_language_definition my_lang_definition;
int main() {
textparser_defer(handle); // Auto-cleanup
// Open a file
int err = textparser_openfile("example.txt", TEXTPARSER_ENCODING_LATIN1, &handle);
if (err) {
fprintf(stderr, "Failed to open file\n");
return 1;
}
// Parse using the language definition
err = textparser_parse(handle, &my_lang_definition);
if (err) {
fprintf(stderr, "Parse error\n");
return 1;
}
// Iterate through tokens
for (textparser_token_item *item = textparser_get_first_token(handle); item != NULL; item = item->next) {
// ... process item ...
}
return 0;
}
```
## Language Definition Example
TextParser uses a JSON-based format to define language grammars. This allows for defining complex syntax rules using regular expressions and hierarchical token structures.
Here is a simplified example of what a JSON definition might look like (based on `definitions/json_definition.json`):
```json
{
"name": "json",
"version": 1.0,
"startTokens": ["Object", "Array"],
"tokens": {
"Object": {
"type": "StartStop",
"startRegex": "{",
"endRegex": "}",
"textColor": "0xffd700",
"nestedTokens": ["Key", "String", "Number", "ValueSeparator"]
},
"String": {
"type": "StartStop",
"startRegex": "\"",
"endRegex": "\"",
"textColor": "0xce9178",
"nestedTokens": ["StringEscape"]
},
"Number": {
"type": "SimpleToken",
"startRegex": "-?\\d+(?:\\.\\d+)?",
"textColor": "0xb5cea8"
}
}
}
```
## Development and Verification
The `python/` directory contains tools for verifying the parser's correctness, particularly for CFML.
- **`validate_cfml.py`**: A robust validation script that compares the AST generated by this project against reference parsers (e.g., a Java-based CFML parser) to ensuring high fidelity and correctness.
```bash
python3 python/validate_cfml.py /path/to/cfml/files
```
## License
See `LICENSE` file for details.