An open API service indexing awesome lists of open source software.

https://github.com/open-technology-foundation/mailheader

Very fast email parsing utilities for extracting headers and message bodies. Available as both standalone binaries and bash loadable builtins.
https://github.com/open-technology-foundation/mailheader

bash bash-scripting mail mail-header

Last synced: 9 months ago
JSON representation

Very fast email parsing utilities for extracting headers and message bodies. Available as both standalone binaries and bash loadable builtins.

Awesome Lists containing this project

README

          

# Mail Tools

Fast email parsing utilities for extracting headers, message bodies, and cleaning bloat headers. Available as both standalone binaries and bash loadable builtins.

## Purpose and Use Cases

These utilities solve common email processing challenges in scripts and automation:

**Use Cases:**
- **Email archival**: Strip bloat headers before archiving to reduce storage by ~20% (mailheaderclean)
- **Email parsing**: Extract sender, subject, dates from mailbox files (mailheader + parsing)
- **Mailbox processing**: Process thousands of emails efficiently with bash builtins (10-20x faster)
- **Email splitting**: Separate headers from body for independent processing
- **Privacy**: Remove tracking headers and metadata before forwarding emails
- **Thunderbird integration**: Clean emails while preserving client-specific headers
- **Batch operations**: Clean entire directories of emails in-place

**Why these tools?**
- **Performance**: Bash builtins eliminate fork/exec overhead (~1-2ms → ~0.1ms per call)
- **Simplicity**: Single-purpose utilities that do one thing well
- **RFC 822 compliance**: Proper handling of header continuation lines
- **Flexibility**: Dual implementation (binaries + builtins) for different contexts
- **No dependencies**: Pure C, minimal external requirements

**Comparison with alternatives:**
- **vs formail/reformail**: Focused specifically on header/body extraction and cleaning
- **vs python/perl**: No interpreter overhead, much faster for batch processing
- **vs awk/sed**: Built-in RFC 822 support, easier to use correctly

## Utilities

### mailheader
Extracts email headers (everything up to the first blank line).

- Handles RFC 822 continuation lines (joins lines starting with whitespace)
- Normalizes formatting (removes `\r`, converts tabs to spaces)
- Available as binary and builtin

```bash
mailheader email.eml
mailheader -h # Show help
```

### mailmessage
Extracts email message body (everything after the first blank line).

- Skips the header section entirely
- Preserves message formatting
- Available as binary and builtin

```bash
mailmessage email.eml
mailmessage -h # Show help
```

### mailheaderclean
Filters non-essential email headers from complete email files.

- Removes ~207 bloat headers (Microsoft Exchange, security vendors, tracking, etc.)
- Keeps only the first "Received" header
- Preserves essential routing headers and complete message body
- Supports flexible header filtering via environment variables
- Available as binary and builtin

```bash
mailheaderclean email.eml > cleaned.eml
mailheaderclean -h # Show help
```

**Environment variables:**
- `MAILHEADERCLEAN`: Replace entire removal list with custom headers
- `MAILHEADERCLEAN_PRESERVE`: Exclude specific headers from removal (e.g., for Thunderbird features)
- `MAILHEADERCLEAN_EXTRA`: Add additional headers to the built-in removal list

### mailgetaddresses
Extracts email addresses from From, To, and Cc headers in email files.

- Handles various email formats (with/without names, quoted strings, etc.)
- Multiple recipients per header
- Optional output formatting (with names, separated by header type)
- Selective header extraction
- Accepts multiple files and/or directories as arguments
- Processes all files in given directories
- Directory exclusions (default: .Junk, .Trash, .Sent) with override support
- **Name cleaning**: Decodes RFC 2047 encoded names, removes quotes and parenthetical notation
- Removes redundant names (when name equals email address)

```bash
mailgetaddresses email.eml # Extract from single file
mailgetaddresses -n email.eml # Include names with addresses
mailgetaddresses -s email.eml # Separate by header type
mailgetaddresses -H from,to email.eml # Extract only From and To
mailgetaddresses file1.eml file2.eml file3.eml # Multiple files
mailgetaddresses /path/to/maildir/ # Process directory (excludes .Junk,.Trash,.Sent)
mailgetaddresses -x .spam,.Junk /path/to/mail/ # Custom directory exclusions
mailgetaddresses --exclude '' /path/to/mail/ # No exclusions (process all subdirs)
mailgetaddresses email.eml /path/to/maildir/ # Combine files and dirs
mailgetaddresses /path/to/maildir/ | sort -fu # Deduplicate (case-insensitive)
mailgetaddresses -n /path/to/maildir/ | sort -fu > contacts.txt # Build contact list
mailgetaddresses --help # Show help
```

**Name Cleaning Examples:**
```bash
# Input: =?utf-8?Q?Undangan_Pelatihan?=
# Output: Undangan Pelatihan

# Input: 'John Doe'
# Output: John Doe

# Input: Jane Smith (jane@example.com)
# Output: Jane Smith

# Input: bob@test.com
# Output: bob@test.com
```

### clean-email-headers
Production script for batch cleaning of email files or directories in-place.

- Process single files or entire directories
- Age filtering with `-d/--days` option
- Configurable directory traversal depth
- Preserves timestamps and permissions
- Progress reporting and error handling

```bash
clean-email-headers email.eml # Clean single file
clean-email-headers /path/to/maildir # Clean all files in directory
clean-email-headers -d 7 /path/to/maildir # Only files from last 7 days
clean-email-headers -m 2 /path/to/maildir # Traverse 2 levels deep
clean-email-headers -h # Show help
```

All utilities support:
- **Help options**: `-h` or `--help` for usage information
- **Consistent exit codes**: 0 (success), 1 (file error), 2 (usage error)
- **Dual implementation**:
- Standalone binaries for general use
- Bash loadable builtins for high-performance scripting (10-20x faster)

## Installation

### Prerequisites

For the bash loadable builtin:
```bash
sudo apt-get install bash-builtins # Ubuntu/Debian
```

### One-Liner Install

Quick installation without keeping the source:

```bash
git clone https://github.com/Open-Technology-Foundation/mailheader.git && cd mailheader && sudo ./install.sh --builtin && cd .. && rm -rf mailheader
```

This will clone, install, and clean up the source in one command.

### Getting the Source

For manual installation or to keep the source:

```bash
git clone https://github.com/Open-Technology-Foundation/mailheader.git
cd mailheader
```

**Optional**: Keep a system-wide copy of the source for future updates or rebuilds:

```bash
# Move to traditional source location (optional)
sudo mv mailheader /usr/local/src/
cd /usr/local/src/mailheader
```

**Note**: The source code is not required after installation. You can delete the cloned directory after running `install.sh` and re-clone from GitHub if needed later.

### Quick Install (Recommended)

Use the installation script for an interactive, automated installation:

```bash
sudo ./install.sh
```

The script will:
- Check prerequisites and build all utilities
- Prompt whether to install the bash builtins (optional)
- Install all files to the correct locations
- Update the man database

**Options:**
```bash
sudo ./install.sh --help # Show all options
sudo ./install.sh --builtin # Force builtins (auto-installs dependencies)
sudo ./install.sh --no-builtin # Skip builtin installation
sudo ./install.sh --uninstall # Remove installation
sudo ./install.sh --prefix=/opt # Custom install location
sudo ./install.sh --dry-run # Preview without installing
```

**Note:** Using `--builtin` will automatically install the `bash-builtins` package if it's not already present (Debian/Ubuntu).

### Manual Build and Install

Alternatively, use make directly:

```bash
# Build all utilities
make

# Install system-wide (requires sudo)
sudo make install
```

This installs:
- Standalone binaries: `/usr/local/bin/{mailheader,mailmessage,mailheaderclean}`
- Production script: `/usr/local/bin/clean-email-headers`
- Loadable builtins: `/usr/local/lib/bash/loadables/{mailheader,mailmessage,mailheaderclean}.so`
- Auto-load script: `/etc/profile.d/mail-tools.sh`
- Manpages: `/usr/local/share/man/man1/{mailheader,mailmessage,mailheaderclean}.1`
- Documentation: `/usr/local/share/doc/mailheader/`

### Verify Installation

```bash
# Check standalone binaries
which mailheader mailmessage mailheaderclean clean-email-headers

# Check builtins (after opening new shell or sourcing profile)
enable -a | grep mail

# View help
mailheader -h
mailmessage -h
mailheaderclean -h
clean-email-headers -h

# View manpages
man mailheader
man mailmessage
man mailheaderclean

# Get help for builtins
help mailheader
help mailmessage
help mailheaderclean
```

## Usage

### Interactive Shell

The builtins are automatically loaded in new bash sessions:
```bash
mailheader email.eml
mailmessage email.eml
mailheaderclean email.eml
```

### Scripts

Scripts must explicitly enable the builtins:
```bash
#!/bin/bash
enable -f mailheader.so mailheader 2>/dev/null || true
enable -f mailmessage.so mailmessage 2>/dev/null || true
enable -f mailheaderclean.so mailheaderclean 2>/dev/null || true

# Extract headers
mailheader /path/to/email.eml

# Extract message body
mailmessage /path/to/email.eml

# Clean bloat headers from email
mailheaderclean /path/to/email.eml > cleaned.eml

# Split an email into parts
mailheader email.eml > headers.txt
mailmessage email.eml > body.txt

# Remove custom headers
MAILHEADERCLEAN_EXTRA="X-Custom,X-Internal" mailheaderclean email.eml
```

### Cron Jobs

Cron requires explicit setup:
```bash
# Method 1: Source profile
*/15 * * * * bash -c 'source /etc/profile.d/mail-tools.sh; mailheader /path/to/email.eml'

# Method 2: Explicit enable
*/15 * * * * bash -c 'enable -f mailheader.so mailheader; mailheader /path/to/email.eml'

# Method 3: Use standalone binary
*/15 * * * * /usr/local/bin/mailheader /path/to/email.eml

# Method 4: Use production script
0 2 * * * /usr/local/bin/clean-email-headers -d 30 /var/mail/archive 2>/var/log/email-clean.log
```

## Examples

### Basic Usage

Given `test.eml`:
```
From: sender@example.com
To: recipient@example.com
Subject: Test email with
continuation line
Date: Mon, 1 Jan 2025 12:00:00 +0000

This is the message body.
It can span multiple lines.
```

Extract headers:
```bash
$ mailheader test.eml
From: sender@example.com
To: recipient@example.com
Subject: Test email with continuation line
Date: Mon, 1 Jan 2025 12:00:00 +0000
```

Extract message body:
```bash
$ mailmessage test.eml
This is the message body.
It can span multiple lines.
```

Split email into components:
```bash
$ mailheader email.eml > headers.txt
$ mailmessage email.eml > body.txt
$ cat headers.txt body.txt # Reconstruct email
```

### Cleaning Bloat Headers

Clean single file:
```bash
$ mailheaderclean email.eml > cleaned.eml
```

Use custom removal list entirely:
```bash
$ MAILHEADERCLEAN="X-Spam-Status,Delivered-To" mailheaderclean email.eml
```

Preserve specific headers (e.g., for Thunderbird):
```bash
$ MAILHEADERCLEAN_PRESERVE="List-Unsubscribe,X-Priority" mailheaderclean email.eml
```

Add custom headers to built-in list:
```bash
$ MAILHEADERCLEAN_EXTRA="X-Custom-Header,X-Internal" mailheaderclean email.eml
```

Complex combination:
```bash
$ MAILHEADERCLEAN="DKIM-Signature,List-Unsubscribe" \
MAILHEADERCLEAN_PRESERVE="List-Unsubscribe" \
MAILHEADERCLEAN_EXTRA="X-Custom" mailheaderclean email.eml
# Result: Removes DKIM-Signature and X-Custom, preserves List-Unsubscribe
```

### Production Script Examples

Clean single file in-place:
```bash
clean-email-headers email.eml
```

Clean entire directory:
```bash
clean-email-headers /path/to/maildir
```

Clean only recent files (last 7 days):
```bash
clean-email-headers -d 7 /path/to/maildir
```

Clean with custom depth and verbosity:
```bash
clean-email-headers -m 3 -v /path/to/maildir
```

Quiet mode for cron:
```bash
clean-email-headers -q /path/to/maildir
```

### Advanced Examples

Parse headers into associative array:
```bash
#!/bin/bash
source /usr/local/share/doc/mailheader/mailgetheaders

declare -A Headers
mailgetheaders Headers email.eml

declare -p Headers
echo "From: ${Headers[From]}"
echo "Subject: ${Headers[Subject]}"
echo "Date: ${Headers[Date]}"
```

Bulk email cleaning with progress:
```bash
#!/bin/bash
enable -f mailheaderclean.so mailheaderclean 2>/dev/null || true

export MAILHEADERCLEAN_PRESERVE="List-Unsubscribe,List-Post,X-Priority,Importance"

for email in ~/Maildir/cur/*; do
mailheaderclean "$email" > "/tmp/cleaned/$(basename "$email")"
done
```

Email archive with storage savings:
```bash
#!/bin/bash
enable -f mailheaderclean.so mailheaderclean 2>/dev/null || true

total_saved=0
for email in /var/mail/archive/*.eml; do
original_size=$(stat -c%s "$email")
mailheaderclean "$email" > "/archive/clean/$(basename "$email")"
cleaned_size=$(stat -c%s "/archive/clean/$(basename "$email")")
saved=$((original_size - cleaned_size))
total_saved=$((total_saved + saved))
echo "Saved $saved bytes on $(basename "$email")"
done
echo "Total saved: $total_saved bytes"
```

## Exit Codes

All utilities follow standard Unix exit code conventions:

- **0** - Success
- **1** - File error (cannot open/read file)
- **2** - Usage error (incorrect arguments)

Examples:
```bash
mailheader email.eml
echo $? # 0 (success)

mailheader /nonexistent/file
echo $? # 1 (file error)

mailheader
echo $? # 2 (usage error - missing argument)
```

## Performance

The bash builtins eliminate fork/exec overhead:

- **Standalone binaries**: ~1-2ms per call (fork + exec + startup)
- **Bash builtins**: ~0.1ms per call (in-process execution)

For scripts processing many emails, this provides **10-20x speedup**.

### Benchmarking

The project includes comprehensive benchmarking tools to measure performance:

```bash
# Basic performance comparison
./benchmark.sh

# Detailed scaling analysis with multiple file counts
./benchmark_detailed.sh
```

Both scripts compare builtin vs standalone performance across different file counts. Results typically show:
- **Small files (1-2KB)**: 15-20x speedup with builtins
- **Large files (>100KB)**: 8-12x speedup with builtins
- **Overall**: 10-20x average speedup for typical email processing

### Real-World Performance

Testing with 632 real email files (8.3MB total):

**mailheaderclean cleaning operation:**
- **Standalone**: ~2.5 seconds
- **Builtin**: ~0.15 seconds
- **Speedup**: ~16x faster

**Storage savings:**
- Original size: 8.3MB
- Cleaned size: 6.6MB
- **Savings**: 1.7MB (~20% reduction)

## Testing

The project includes a comprehensive test suite with **632 real email files** from various sources.

### Running Tests

```bash
cd tests

# Comprehensive tests (all 632 files)
./test_all_mailheader.sh # Test header extraction
./test_all_mailmessage.sh # Test message body extraction
./test_all_mailheaderclean.sh # Test header cleaning

# Quick functionality tests
./test_simple.sh # Basic functionality
./test_builtin_vs_standalone.sh # Verify identical output
./validate_email_format.sh # RFC 822 compliance

# Environment variable tests
./test_env_vars.sh
./test_mailheaderclean.sh
```

### Test Results

All tests pass with 100% success rate on critical functionality:

- ✓ **632/633** files produce valid header extraction
- ✓ **632/633** files produce valid message extraction
- ✓ **632/633** files produce valid cleaned output
- ✓ **632/632** files maintain valid RFC 822 format after cleaning
- ✓ **100%** identical output between standalone and builtin versions
- ✓ All environment variable combinations work correctly

See `tests/README.md` for detailed test documentation.

## Build Targets

```bash
make # Build all utilities (both versions)
make all-mailheader # Build mailheader only
make all-mailmessage # Build mailmessage only
make all-mailheaderclean # Build mailheaderclean only
make standalone # Build all standalone binaries
make loadable # Build all builtins
make clean # Remove build artifacts
sudo make install # Install all utilities
sudo make uninstall # Remove all installed files
make help # Show all available targets
```

## How the Builtins Work

The builtins are automatically available in interactive shells via `/etc/profile.d/mail-tools.sh`:

1. Sets `BASH_LOADABLES_PATH=/usr/local/lib/bash/loadables`
2. Auto-loads builtins in interactive shells:
- `enable -f mailheader.so mailheader`
- `enable -f mailmessage.so mailmessage`
- `enable -f mailheaderclean.so mailheaderclean`
3. Non-interactive contexts (scripts, cron) must explicitly enable them

The builtins seamlessly integrate with bash, appearing identical to native commands while providing significant performance benefits.

## Project Structure

```
.
├── mailheader.c # Standalone binary
├── mailheader_loadable.c # Bash builtin
├── mailmessage.c # Standalone binary
├── mailmessage_loadable.c # Bash builtin
├── mailheaderclean.c # Standalone binary
├── mailheaderclean_loadable.c # Bash builtin
├── mailheaderclean_headers.h # Shared header removal list (~207 headers)
├── clean-email-headers # Production batch cleaning script
├── mail-tools.sh # Profile script for auto-loading
├── mailgetheaders # Example bash function
├── Makefile # Build system
├── install.sh # Installation script
├── benchmark.sh # Basic benchmarking
├── benchmark_detailed.sh # Detailed benchmarking
├── *.1 # Man pages
└── tests/ # Test suite (632 email files)
├── test_all_*.sh # Comprehensive tests
├── test_*.sh # Functionality tests
└── test-data/ # 632 real email files
```

## FAQ

**Q: Do I need both the binaries and builtins?**
A: The installation script installs both by default. Use binaries for general scripts and builtins for high-performance batch processing.

**Q: Will this work with my maildir?**
A: Yes, these tools work with any RFC 822 compliant email format, including Maildir, mbox, and individual .eml files.

**Q: How do I customize which headers to remove?**
A: Use the `MAILHEADERCLEAN_*` environment variables to control header filtering. See examples above.

**Q: Are timestamps preserved when cleaning emails?**
A: Yes, when using `clean-email-headers` script. For manual operations, use `touch -r` to preserve timestamps.

**Q: Can I use this in production?**
A: Yes, all utilities are production-ready with comprehensive testing and error handling. The test suite validates against 632 real-world emails.

## License

GNU General Public License v3.0 or later. See LICENSE file for details.

## Contributing

Issues and pull requests welcome at [github.com/Open-Technology-Foundation/mailheader](https://github.com/Open-Technology-Foundation/mailheader)

## Credits

Developed by the Open Technology Foundation for efficient email processing in bash environments.