https://github.com/open-technology-foundation/mail-tools
Fast email parsing utilities for extracting headers, message bodies, and cleaning bloat headers. Available as both standalone binaries and bash loadable builtins.
https://github.com/open-technology-foundation/mail-tools
bash bash-scripting mail mail-header mdir
Last synced: about 1 month ago
JSON representation
Fast email parsing utilities for extracting headers, message bodies, and cleaning bloat headers. Available as both standalone binaries and bash loadable builtins.
- Host: GitHub
- URL: https://github.com/open-technology-foundation/mail-tools
- Owner: Open-Technology-Foundation
- License: gpl-3.0
- Created: 2025-10-05T05:10:01.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-10-18T07:00:18.000Z (8 months ago)
- Last Synced: 2026-04-14T18:38:17.432Z (2 months ago)
- Topics: bash, bash-scripting, mail, mail-header, mdir
- Language: Shell
- Homepage: https://yatti.id/
- Size: 3.29 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# Mail Tools
[](https://www.gnu.org/licenses/gpl-3.0)
[](https://github.com/Open-Technology-Foundation/mail-tools/actions)
[](https://github.com/Open-Technology-Foundation/mail-tools/issues)
Fast email parsing utilities for extracting headers, message bodies, and cleaning bloat headers. Available as both standalone binaries and bash loadable builtins.
**Features:**
- Professional directory structure (src/, scripts/, man/, examples/, tools/, build/)
- Clean separation of source code and build artifacts
- Comprehensive test suite with 632 real-world email files
- Automated installation with dependency management
- Bash completions for all utilities (tab completion for options and files)
## Purpose and Use Cases
These utilities solve common email processing challenges in scripts and automation:
**Use Cases:**
- **Email archival**: Strip bloat headers before archiving to reduce storage by ~20% (mailheaderclean)
- **Email parsing**: Extract sender, subject, dates from mailbox files (mailheader + parsing)
- **Mailbox processing**: Process thousands of emails efficiently with bash builtins (10-20x faster)
- **Email splitting**: Separate headers from body for independent processing
- **Privacy**: Remove tracking headers and metadata before forwarding emails
- **Thunderbird integration**: Clean emails while preserving client-specific headers
- **Batch operations**: Clean entire directories of emails in-place
**Why these tools?**
- **Performance**: Bash builtins eliminate fork/exec overhead (~1-2ms → ~0.1ms per call)
- **Simplicity**: Single-purpose utilities that do one thing well
- **RFC 822 compliance**: Proper handling of header continuation lines
- **Flexibility**: Dual implementation (binaries + builtins) for different contexts
- **No dependencies**: Pure C, minimal external requirements
**Comparison with alternatives:**
- **vs formail/reformail**: Focused specifically on header/body extraction and cleaning
- **vs python/perl**: No interpreter overhead, much faster for batch processing
- **vs awk/sed**: Built-in RFC 822 support, easier to use correctly
## Utilities
### mailheader
Extracts email headers (everything up to the first blank line).
- Handles RFC 822 continuation lines (joins lines starting with whitespace)
- Normalizes formatting (removes `\r`, converts tabs to spaces)
- Available as binary and builtin
```bash
mailheader email.eml
mailheader -h # Show help
```
### mailmessage
Extracts email message body (everything after the first blank line).
- Skips the header section entirely
- Preserves message formatting
- Available as binary and builtin
```bash
mailmessage email.eml
mailmessage -h # Show help
```
### mailheaderclean
Filters non-essential email headers from complete email files.
- Removes ~207 bloat headers (Microsoft Exchange, security vendors, tracking, etc.)
- Keeps only the first "Received" header
- Preserves essential routing headers and complete message body
- Supports flexible header filtering via environment variables
- Available as binary and builtin
```bash
mailheaderclean email.eml > cleaned.eml
mailheaderclean -l # List active removal headers
mailheaderclean -h # Show help
```
**Environment variables** (processed in this order):
1. `MAILHEADERCLEAN`: Replace entire removal list with custom headers (or use built-in if not set)
2. `MAILHEADERCLEAN_PRESERVE`: Exclude specific headers from removal (e.g., for Thunderbird features)
3. `MAILHEADERCLEAN_EXTRA`: Add additional headers to the removal list
**Formula:** `(MAILHEADERCLEAN or built-in) - PRESERVE + EXTRA`
**Wildcard patterns supported** (shell glob syntax):
- `X-*` - Match any header starting with X-
- `*-Status` - Match any header ending with -Status
- `X-MS-*` - Match any header starting with X-MS-
- `X-*-Status` - Match X- followed by anything, ending in -Status
### mailgetaddresses
Bash script that extracts email addresses from From, To, and Cc headers in email files.
- Handles various email formats (with/without names, quoted strings, etc.)
- Multiple recipients per header
- Optional output formatting (with names, separated by header type)
- Selective header extraction
- Accepts multiple files and/or directories as arguments
- Processes all files in given directories
- Directory exclusions (default: .Junk, .Trash, .Sent) with override support
- **Name cleaning**: Decodes RFC 2047 encoded names, removes quotes and parenthetical notation
- Removes redundant names (when name equals email address)
```bash
mailgetaddresses email.eml # Extract from single file
mailgetaddresses -n email.eml # Include names with addresses
mailgetaddresses -s email.eml # Separate by header type
mailgetaddresses -H from,to email.eml # Extract only From and To
mailgetaddresses file1.eml file2.eml file3.eml # Multiple files
mailgetaddresses /path/to/maildir/ # Process directory (excludes .Junk,.Trash,.Sent)
mailgetaddresses -x .spam,.Junk /path/to/mail/ # Custom directory exclusions
mailgetaddresses --exclude '' /path/to/mail/ # No exclusions (process all subdirs)
mailgetaddresses email.eml /path/to/maildir/ # Combine files and dirs
mailgetaddresses /path/to/maildir/ | sort -fu # Deduplicate (case-insensitive)
mailgetaddresses -n /path/to/maildir/ | sort -fu > contacts.txt # Build contact list
mailgetaddresses --help # Show help
```
**Name Cleaning Examples:**
```bash
# Input: =?utf-8?Q?Undangan_Pelatihan?=
# Output: Undangan Pelatihan
# Input: 'John Doe'
# Output: John Doe
# Input: Jane Smith (jane@example.com)
# Output: Jane Smith
# Input: bob@test.com
# Output: bob@test.com
```
**Post-Processing: Filtering with Exclude Patterns:**
For excluding specific domains or patterns from the results, use `grep -v -F -f`:
```bash
# Create exclude pattern file
cat > /tmp/exclude-patterns.list </dev/null || true
enable -f mailmessage.so mailmessage 2>/dev/null || true
enable -f mailheaderclean.so mailheaderclean 2>/dev/null || true
# Extract headers
mailheader /path/to/email.eml
# Extract message body
mailmessage /path/to/email.eml
# Clean bloat headers from email
mailheaderclean /path/to/email.eml > cleaned.eml
# Split an email into parts
mailheader email.eml > headers.txt
mailmessage email.eml > body.txt
# Remove custom headers
MAILHEADERCLEAN_EXTRA="X-Custom,X-Internal" mailheaderclean email.eml
```
### Cron Jobs
Cron requires explicit setup:
```bash
# Method 1: Source profile
*/15 * * * * bash -c 'source /etc/profile.d/mail-tools.sh; mailheader /path/to/email.eml'
# Method 2: Explicit enable
*/15 * * * * bash -c 'enable -f mailheader.so mailheader; mailheader /path/to/email.eml'
# Method 3: Use standalone binary
*/15 * * * * /usr/local/bin/mailheader /path/to/email.eml
# Method 4: Use production script
0 2 * * * /usr/local/bin/mailheaderclean-batch -d 30 /var/mail/archive 2>/var/log/email-clean.log
```
## Examples
### Basic Usage
Given `test.eml`:
```
From: sender@example.com
To: recipient@example.com
Subject: Test email with
continuation line
Date: Mon, 1 Jan 2025 12:00:00 +0000
This is the message body.
It can span multiple lines.
```
Extract headers:
```bash
$ mailheader test.eml
From: sender@example.com
To: recipient@example.com
Subject: Test email with continuation line
Date: Mon, 1 Jan 2025 12:00:00 +0000
```
Extract message body:
```bash
$ mailmessage test.eml
This is the message body.
It can span multiple lines.
```
Split email into components:
```bash
$ mailheader email.eml > headers.txt
$ mailmessage email.eml > body.txt
$ cat headers.txt body.txt # Reconstruct email
```
### Cleaning Bloat Headers
List currently active removal headers:
```bash
$ mailheaderclean -l
x-ms-exchange-antispam-messagedata-0
x-ms-exchange-antispam-messagedata-chunkcount
...
X-Spam-Score
X-Proofpoint-NotVirusScanned
# With environment variables applied
$ MAILHEADERCLEAN_PRESERVE="List-Unsubscribe,X-Priority" mailheaderclean -l
# Shows built-in list minus preserved headers
$ MAILHEADERCLEAN="X-Spam-Status,DKIM-Signature" mailheaderclean -l
X-Spam-Status
DKIM-Signature
```
Clean single file:
```bash
$ mailheaderclean email.eml > cleaned.eml
```
Use custom removal list entirely:
```bash
$ MAILHEADERCLEAN="X-Spam-Status,Delivered-To" mailheaderclean email.eml
```
Preserve specific headers (e.g., for Thunderbird):
```bash
$ MAILHEADERCLEAN_PRESERVE="List-Unsubscribe,X-Priority" mailheaderclean email.eml
```
Add custom headers to built-in list:
```bash
$ MAILHEADERCLEAN_EXTRA="X-Custom-Header,X-Internal" mailheaderclean email.eml
```
Complex combination:
```bash
$ MAILHEADERCLEAN="DKIM-Signature,List-Unsubscribe" \
MAILHEADERCLEAN_PRESERVE="List-Unsubscribe" \
MAILHEADERCLEAN_EXTRA="X-Custom" mailheaderclean email.eml
# Result: Removes DKIM-Signature and X-Custom, preserves List-Unsubscribe
```
Use wildcard patterns:
```bash
# Remove all X- headers
$ MAILHEADERCLEAN="X-*" mailheaderclean email.eml
# Remove all Microsoft headers
$ MAILHEADERCLEAN_EXTRA="X-Microsoft-*,X-MS-*" mailheaderclean email.eml
# Remove all spam and status headers
$ MAILHEADERCLEAN="X-Spam-*,*-Status" mailheaderclean email.eml
# Complex: remove all X-MS- headers except X-MS-TNEF-Correlator
$ MAILHEADERCLEAN_EXTRA="X-MS-*" \
MAILHEADERCLEAN_PRESERVE="X-MS-TNEF-Correlator" mailheaderclean email.eml
```
### Production Script Examples
Clean single file in-place:
```bash
mailheaderclean-batch email.eml
```
Clean entire directory:
```bash
mailheaderclean-batch /path/to/maildir
```
Clean only recent files (last 7 days):
```bash
mailheaderclean-batch -d 7 /path/to/maildir
```
Clean with custom depth and verbosity:
```bash
mailheaderclean-batch -m 3 -v /path/to/maildir
```
Quiet mode for cron:
```bash
mailheaderclean-batch -q /path/to/maildir
```
### Advanced Examples
Parse headers into associative array:
```bash
#!/bin/bash
declare -A Headers
eval "$(mailgetheaders email.eml)"
echo "From: ${Headers[From]}"
echo "Subject: ${Headers[Subject]}"
echo "Date: ${Headers[Date]}"
echo "File: ${Headers[file]}"
```
Bulk email cleaning with progress:
```bash
#!/bin/bash
enable -f mailheaderclean.so mailheaderclean 2>/dev/null || true
export MAILHEADERCLEAN_PRESERVE="List-Unsubscribe,List-Post,X-Priority,Importance"
for email in ~/Maildir/cur/*; do
mailheaderclean "$email" > "/tmp/cleaned/$(basename "$email")"
done
```
Email archive with storage savings:
```bash
#!/bin/bash
enable -f mailheaderclean.so mailheaderclean 2>/dev/null || true
total_saved=0
for email in /var/mail/archive/*.eml; do
original_size=$(stat -c%s "$email")
mailheaderclean "$email" > "/archive/clean/$(basename "$email")"
cleaned_size=$(stat -c%s "/archive/clean/$(basename "$email")")
saved=$((original_size - cleaned_size))
total_saved=$((total_saved + saved))
echo "Saved $saved bytes on $(basename "$email")"
done
echo "Total saved: $total_saved bytes"
```
## Exit Codes
All utilities follow standard Unix exit code conventions:
- **0** - Success
- **1** - File error (cannot open/read file)
- **2** - Usage error (incorrect arguments)
Examples:
```bash
mailheader email.eml
echo $? # 0 (success)
mailheader /nonexistent/file
echo $? # 1 (file error)
mailheader
echo $? # 2 (usage error - missing argument)
```
## Performance
The bash builtins eliminate fork/exec overhead:
- **Standalone binaries**: ~1-2ms per call (fork + exec + startup)
- **Bash builtins**: ~0.1ms per call (in-process execution)
For scripts processing many emails, this provides **10-20x speedup**.
### Benchmarking
The project includes comprehensive benchmarking tools to measure performance:
```bash
# Basic performance comparison
tools/benchmark.sh
# Detailed scaling analysis with multiple file counts
tools/benchmark_detailed.sh
```
Both scripts compare builtin vs standalone performance across different file counts. Results typically show:
- **Small files (1-2KB)**: 15-20x speedup with builtins
- **Large files (>100KB)**: 8-12x speedup with builtins
- **Overall**: 10-20x average speedup for typical email processing
### Real-World Performance
Testing with 632 real email files (8.3MB total):
**mailheaderclean cleaning operation:**
- **Standalone**: ~2.5 seconds
- **Builtin**: ~0.15 seconds
- **Speedup**: ~16x faster
**Storage savings:**
- Original size: 8.3MB
- Cleaned size: 6.6MB
- **Savings**: 1.7MB (~20% reduction)
## Testing
The project includes a comprehensive test suite with **632 real email files** from various sources.
### Running Tests
```bash
cd tests
# Master test runner (runs all tests)
./test_master.sh
# Repository structure and build tests
./test_structure.sh # Validate directory structure
./test_build_system.sh # Test Makefile and build process
./test_installation.sh # Test install.sh functionality
# Comprehensive tests (all 632 files)
./test_all_mailheader.sh # Test header extraction
./test_all_mailmessage.sh # Test message body extraction
./test_all_mailheaderclean.sh # Test header cleaning
# Quick functionality tests
./test_simple.sh # Basic functionality
./test_builtin_vs_standalone.sh # Verify identical output
./validate_email_format.sh # RFC 822 compliance
# Script tests
./test_mailgetaddresses.sh # Test address extraction
./test_mailgetheaders.sh # Test header parsing
# Environment variable tests
./test_env_vars.sh
./test_mailheaderclean.sh
```
### Test Results
All tests pass with 100% success rate on critical functionality:
- ✓ **632/632** files produce valid header extraction
- ✓ **632/632** files produce valid message extraction
- ✓ **632/632** files produce valid cleaned output
- ✓ **632/632** files maintain valid RFC 822 format after cleaning
- ✓ **100%** identical output between standalone and builtin versions
- ✓ All environment variable combinations work correctly
See `tests/README.md` for detailed test documentation.
## Build Targets
The build system uses organized directories for clean separation of source and artifacts:
```bash
make # Build all utilities (output to build/)
make all-mailheader # Build mailheader only
make all-mailmessage # Build mailmessage only
make all-mailheaderclean # Build mailheaderclean only
make standalone # Build all standalone binaries
make loadable # Build all builtins
make clean # Remove build/ directory
sudo make install # Install all utilities system-wide
sudo make uninstall # Remove all installed files
make help # Show all available targets
```
Build artifacts are organized in `build/`:
- `build/bin/` - Standalone executables
- `build/lib/` - Loadable builtins (.so files)
- `build/obj/` - Object files
## How the Builtins Work
The builtins are automatically available in interactive shells via `/etc/profile.d/mail-tools.sh`:
1. Sets `BASH_LOADABLES_PATH=/usr/local/lib/bash/loadables`
2. Auto-loads builtins in interactive shells:
- `enable -f mailheader.so mailheader`
- `enable -f mailmessage.so mailmessage`
- `enable -f mailheaderclean.so mailheaderclean`
3. Non-interactive contexts (scripts, cron) must explicitly enable them
The builtins seamlessly integrate with bash, appearing identical to native commands while providing significant performance benefits.
## Bash Completions
All mail-tools utilities include intelligent bash completions for improved usability. Completions are automatically installed and become available in new bash sessions.
**Features:**
- Tab completion for command options (e.g., `-h`, `--help`, `-l`, `-n`, `-s`)
- File and directory path completion
- Smart option-specific suggestions:
- `mailgetaddresses -H ` suggests common header names (from, to, cc, all, etc.)
- `mailgetaddresses -x ` suggests common exclusion patterns (.Junk, .Trash, .Sent)
- `mailheaderclean ` completes with `-l` and `-h` options
- `mailheaderclean-batch ` completes with `-d`, `-m`, `-v`, `-q` options
**Usage:**
```bash
# Type command and press TAB to see available options
mailheaderclean -
# Shows: -l -h --help
# Complete header options
mailgetaddresses -H
# Shows: from to cc bcc reply-to from,to from,to,cc from,cc all
# Complete exclusion patterns
mailgetaddresses -x
# Shows: .Junk .Trash .Sent .Drafts .Spam .Junk,.Trash .Junk,.Trash,.Sent
# File path completion works for all utilities
mailheader /path/to/
# Shows available files and directories
```
**Activation:**
- **New sessions**: Completions are automatically available after installation
- **Current session**: Run `source /usr/local/share/bash-completion/completions/mail-tools`
- **Verify**: Type `mailheaderclean -` to test completion
**Supported utilities:**
- `mailheader` - Options: `-h`, `--help`
- `mailmessage` - Options: `-h`, `--help`
- `mailheaderclean` - Options: `-l`, `-h`, `--help`
- `mailgetaddresses` - Options: `-n`, `-s`, `-H`, `-x`, `-h`, `--help` (with smart suggestions)
- `mailgetheaders` - Options: `-h`, `--help`, `-V`, `--version`
- `mailheaderclean-batch` - Options: `-d`, `-m`, `-v`, `-q`, `-V`, `--version`, `-h`, `--help`
- `clean-email-headers` - Same as mailheaderclean-batch (symlink support)
## Project Structure
```
.
├── src/ # Source code
│ ├── mailheader.c # mailheader standalone binary
│ ├── mailheader_loadable.c # mailheader bash builtin
│ ├── mailmessage.c # mailmessage standalone binary
│ ├── mailmessage_loadable.c # mailmessage bash builtin
│ ├── mailheaderclean.c # mailheaderclean standalone binary
│ ├── mailheaderclean_loadable.c # mailheaderclean bash builtin
│ └── mailheaderclean_headers.h # Shared header removal list (~207 headers)
├── scripts/ # Bash scripts
│ ├── mailgetaddresses # Address extraction script
│ ├── mailgetheaders # Header parsing script
│ ├── mailheaderclean-batch # Production batch cleaning script
│ ├── mail-tools.sh # Profile script for auto-loading builtins
│ └── check-installation.sh # Diagnostic script for verifying installation
├── man/ # Manual pages
│ ├── mailheader.1
│ ├── mailmessage.1
│ ├── mailheaderclean.1
│ └── mailgetaddresses.1
├── examples/ # Sample email files
│ ├── test.eml
│ └── test-bloat.eml
├── tools/ # Benchmarking utilities
│ ├── benchmark.sh
│ └── benchmark_detailed.sh
├── build/ # Build artifacts (generated)
│ ├── bin/ # Compiled binaries
│ │ ├── mailheader
│ │ ├── mailmessage
│ │ └── mailheaderclean
│ ├── lib/ # Loadable builtins
│ │ ├── mailheader.so
│ │ ├── mailmessage.so
│ │ └── mailheaderclean.so
│ └── obj/ # Object files
│ ├── mailheader.o
│ ├── mailmessage.o
│ └── mailheaderclean.o
├── tests/ # Test suite (632 email files)
│ ├── test_master.sh # Master test runner
│ ├── test_structure.sh # Repository structure validation
│ ├── test_build_system.sh # Build system validation
│ ├── test_installation.sh # Installation script testing
│ ├── test_mailgetheaders.sh # Header parsing script tests
│ ├── test_all_mailheader.sh # Comprehensive header tests
│ ├── test_all_mailmessage.sh # Comprehensive message tests
│ ├── test_all_mailheaderclean.sh # Comprehensive cleaning tests
│ ├── test_*.sh # Additional functionality tests
│ └── test-data/ # 632 real email files
├── mail-tools.bash_completions # Bash completion definitions
├── Makefile # Build system
├── install.sh # Installation script
├── README.md # User documentation
└── LICENSE # GPL v3.0
```
## FAQ
**Q: Do I need both the binaries and builtins?**
A: The installation script installs both by default. Use binaries for general scripts and builtins for high-performance batch processing.
**Q: Will this work with my maildir?**
A: Yes, these tools work with any RFC 822 compliant email format, including Maildir, mbox, and individual .eml files.
**Q: How do I customize which headers to remove?**
A: Use the `MAILHEADERCLEAN_*` environment variables to control header filtering. See examples above.
**Q: Are timestamps preserved when cleaning emails?**
A: Yes, when using `mailheaderclean-batch` script (or the backwards-compatible `clean-email-headers` symlink). For manual operations, use `touch -r` to preserve timestamps.
**Q: Can I use this in production?**
A: Yes, all utilities are production-ready with comprehensive testing and error handling. The test suite validates against 632 real-world emails.
## License
GNU General Public License v3.0 or later. See LICENSE file for details.
## Contributing
Issues and pull requests welcome at [github.com/Open-Technology-Foundation/mail-tools](https://github.com/Open-Technology-Foundation/mail-tools)
## Credits
Developed by the Open Technology Foundation for efficient email processing in bash environments.