Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/stalwartlabs/mail-parser

Fast and robust e-mail parsing library for Rust
https://github.com/stalwartlabs/mail-parser

email mail mime mime-parser parser parsing rust

Last synced: about 2 months ago
JSON representation

Fast and robust e-mail parsing library for Rust

Lists

README

        

# mail-parser

[![crates.io](https://img.shields.io/crates/v/mail-parser)](https://crates.io/crates/mail-parser)
[![build](https://github.com/stalwartlabs/mail-parser/actions/workflows/rust.yml/badge.svg)](https://github.com/stalwartlabs/mail-parser/actions/workflows/rust.yml)
[![docs.rs](https://img.shields.io/docsrs/mail-parser)](https://docs.rs/mail-parser)
[![crates.io](https://img.shields.io/crates/l/mail-parser)](http://www.apache.org/licenses/LICENSE-2.0)

_mail-parser_ is an **e-mail parsing library** written in Rust that fully conforms to the Internet Message Format standard (_RFC 5322_), the
Multipurpose Internet Mail Extensions (MIME; _RFC 2045 - 2049_) as well as many other [internet messaging RFCs](#conformed-rfcs).

It also supports decoding messages in [41 different character sets](#supported-character-sets) including obsolete formats such as UTF-7.
All Unicode (UTF-*) and single-byte character sets are handled internally by the library while support for legacy multi-byte encodings of Chinese
and Japanese languages such as BIG5 or ISO-2022-JP is provided by the optional dependency [encoding_rs](https://crates.io/crates/encoding_rs).

In general, this library abides by the Postel's law or [Robustness Principle](https://en.wikipedia.org/wiki/Robustness_principle) which
states that an implementation must be conservative in its sending behavior and liberal in its receiving behavior. This means that
_mail-parser_ will make a best effort to parse non-conformant e-mail messages as long as these do not deviate too much from the standard.

Unlike other e-mail parsing libraries that return nested representations of the different MIME parts in a message, this library
conforms to [RFC 8621, Section 4.1.4](https://datatracker.ietf.org/doc/html/rfc8621#section-4.1.4) and provides a more human-friendly
representation of the message contents consisting of just text body parts, html body parts and attachments. Additionally, conversion to/from
HTML and plain text inline body parts is done automatically when the _alternative_ version is missing.

Performance and memory safety were two important factors while designing _mail-parser_:

- **Zero-copy**: Practically all strings returned by this library are `Cow` references to the input raw message.
- **High performance Base64 decoding** based on Chromium's decoder ([the fastest non-SIMD decoder](https://github.com/lemire/fastbase64)).
- **Fast parsing** of message header fields, character set names and HTML entities using [perfect hashing](https://en.wikipedia.org/wiki/Perfect_hash_function).
- Written in **100% safe** Rust with no external dependencies.
- Every function in the library has been [fuzzed](#testing-fuzzing--benchmarking) and thoroughly [tested with MIRI](#testing-fuzzing--benchmarking).
- **Battle-tested** with millions of real-world e-mail messages dating from 1995 until today.
- Used in production environments worldwide by [Stalwart Mail Server](https://github.com/stalwartlabs/mail-server).

## Usage Example

```rust
let input = br#"From: Art Vandelay (Vandelay Industries)
To: "Colleagues": "James Smythe" ; Friends:
[email protected], =?UTF-8?Q?John_Sm=C3=AEth?= ;
Date: Sat, 20 Nov 2021 14:22:01 -0800
Subject: Why not both importing AND exporting? =?utf-8?b?4pi6?=
Content-Type: multipart/mixed; boundary="festivus";

--festivus
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: base64

PGh0bWw+PHA+SSB3YXMgdGhpbmtpbmcgYWJvdXQgcXVpdHRpbmcgdGhlICZsZHF1bztle
HBvcnRpbmcmcmRxdW87IHRvIGZvY3VzIGp1c3Qgb24gdGhlICZsZHF1bztpbXBvcnRpbm
cmcmRxdW87LDwvcD48cD5idXQgdGhlbiBJIHRob3VnaHQsIHdoeSBub3QgZG8gYm90aD8
gJiN4MjYzQTs8L3A+PC9odG1sPg==
--festivus
Content-Type: message/rfc822

From: "Cosmo Kramer"
Subject: Exporting my book about coffee tables
Content-Type: multipart/mixed; boundary="giddyup";

--giddyup
Content-Type: text/plain; charset="utf-16"
Content-Transfer-Encoding: quoted-printable

=FF=FE=0C!5=D8"=DD5=D8)=DD5=D8-=DD =005=D8*=DD5=D8"=DD =005=D8"=
=DD5=D85=DD5=D8-=DD5=D8,=DD5=D8/=DD5=D81=DD =005=D8*=DD5=D86=DD =
=005=D8=1F=DD5=D8,=DD5=D8,=DD5=D8(=DD =005=D8-=DD5=D8)=DD5=D8"=
=DD5=D8=1E=DD5=D80=DD5=D8"=DD!=00
--giddyup
Content-Type: image/gif; name*1="about "; name*0="Book ";
name*2*=utf-8''%e2%98%95 tables.gif
Content-Transfer-Encoding: Base64
Content-Disposition: attachment

R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7
--giddyup--
--festivus--
"#;

let message = MessageParser::default().parse(input).unwrap();

// Parses addresses (including comments), lists and groups
assert_eq!(
message.from().unwrap().first().unwrap(),
&Addr::new(
"Art Vandelay (Vandelay Industries)".into(),
"[email protected]"
)
);

assert_eq!(
message.to().unwrap().as_group().unwrap(),
&[
Group::new(
"Colleagues",
vec![Addr::new("James Smythe".into(), "[email protected]")]
),
Group::new(
"Friends",
vec![
Addr::new(None, "[email protected]"),
Addr::new("John Smîth".into(), "[email protected]"),
]
)
]
);

assert_eq!(
message.date().unwrap().to_rfc3339(),
"2021-11-20T14:22:01-08:00"
);

// RFC2047 support for encoded text in message readers
assert_eq!(
message.subject().unwrap(),
"Why not both importing AND exporting? ☺"
);

// HTML and text body parts are returned conforming to RFC8621, Section 4.1.4
assert_eq!(
message.body_html(0).unwrap(),
concat!(
"

I was thinking about quitting the “exporting” to ",
"focus just on the “importing”,

but then I thought,",
" why not do both? ☺

"
)
);

// HTML parts are converted to plain text (and viceversa) when missing
assert_eq!(
message.body_text(0).unwrap(),
concat!(
"I was thinking about quitting the “exporting” to focus just on the",
" “importing”,\nbut then I thought, why not do both? ☺\n"
)
);

// Supports nested messages as well as multipart/digest
let nested_message = message
.attachment(0)
.unwrap()
.message();
.unwrap();

assert_eq!(
nested_message.subject().unwrap(),
"Exporting my book about coffee tables"
);

// Handles UTF-* as well as many legacy encodings
assert_eq!(
nested_message.body_text(0).unwrap(),
"ℌ𝔢𝔩𝔭 𝔪𝔢 𝔢𝔵𝔭𝔬𝔯𝔱 𝔪𝔶 𝔟𝔬𝔬𝔨 𝔭𝔩𝔢𝔞𝔰𝔢!"
);
assert_eq!(
nested_message.body_html(0).unwrap(),
"ℌ𝔢𝔩𝔭 𝔪𝔢 𝔢𝔵𝔭𝔬𝔯𝔱 𝔪𝔶 𝔟𝔬𝔬𝔨 𝔭𝔩𝔢𝔞𝔰𝔢!"
);

let nested_attachment = nested_message.attachment(0).unwrap();

assert_eq!(nested_attachment.len(), 42);

// Full RFC2231 support for continuations and character sets
assert_eq!(
nested_attachment.attachment_name().unwrap(),
"Book about ☕ tables.gif"
);

// Integrates with Serde
println!("{}", serde_json::to_string_pretty(&message).unwrap());
```

More examples available under the [examples](examples) directory. Please note that this library does not support building e-mail messages as this functionality is provided separately by the [`mail-builder`](https://crates.io/crates/mail-builder) crate.

## Testing, Fuzzing & Benchmarking

To run the testsuite:

```bash
$ cargo test --all-features
```

or, to run the testsuite with MIRI:

```bash
$ cargo +nightly miri test --all-features
```

To fuzz the library with `cargo-fuzz`:

```bash
$ cargo +nightly fuzz run mail_parser
```

and, to run the benchmarks:

```bash
$ cargo +nightly bench --all-features
```

## Conformed RFCs

- [RFC 822 - Standard for ARPA Internet Text Messages](https://datatracker.ietf.org/doc/html/rfc822)
- [RFC 5322 - Internet Message Format](https://datatracker.ietf.org/doc/html/rfc5322)
- [RFC 2045 - Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies](https://datatracker.ietf.org/doc/html/rfc2045)
- [RFC 2046 - Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types](https://datatracker.ietf.org/doc/html/rfc2046)
- [RFC 2047 - MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text](https://datatracker.ietf.org/doc/html/rfc2047)
- [RFC 2048 - Multipurpose Internet Mail Extensions (MIME) Part Four: Registration Procedures](https://datatracker.ietf.org/doc/html/rfc2048)
- [RFC 2049 - Multipurpose Internet Mail Extensions (MIME) Part Five: Conformance Criteria and Examples](https://datatracker.ietf.org/doc/html/rfc2049)
- [RFC 2231 - MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations](https://datatracker.ietf.org/doc/html/rfc2231)
- [RFC 2557 - MIME Encapsulation of Aggregate Documents, such as HTML (MHTML)](https://datatracker.ietf.org/doc/html/rfc2557)
- [RFC 2183 - Communicating Presentation Information in Internet Messages: The Content-Disposition Header Field](https://datatracker.ietf.org/doc/html/rfc2183)
- [RFC 2392 - Content-ID and Message-ID Uniform Resource Locators](https://datatracker.ietf.org/doc/html/rfc2392)
- [RFC 3282 - Content Language Headers](https://datatracker.ietf.org/doc/html/rfc3282)
- [RFC 6532 - Internationalized Email Headers](https://datatracker.ietf.org/doc/html/rfc6532)
- [RFC 2152 - UTF-7 - A Mail-Safe Transformation Format of Unicode](https://datatracker.ietf.org/doc/html/rfc2152)
- [RFC 2369 - The Use of URLs as Meta-Syntax for Core Mail List Commands and their Transport through Message Header Fields](https://datatracker.ietf.org/doc/html/rfc2369)
- [RFC 2919 - List-Id: A Structured Field and Namespace for the Identification of Mailing Lists](https://datatracker.ietf.org/doc/html/rfc2919)
- [RFC 3339 - Date and Time on the Internet: Timestamps](https://datatracker.ietf.org/doc/html/rfc3339)
- [RFC 8621 - The JSON Meta Application Protocol (JMAP) for Mail (Section 4.1.4)](https://datatracker.ietf.org/doc/html/rfc8621#section-4.1.4)
- [RFC 5957 - Internet Message Access Protocol - SORT and THREAD Extensions (Section 2.1)](https://datatracker.ietf.org/doc/html/rfc5256#section-2.1)

## Supported Character Sets

- UTF-8
- UTF-16, UTF-16BE, UTF-16LE
- UTF-7
- US-ASCII
- ISO-8859-1
- ISO-8859-2
- ISO-8859-3
- ISO-8859-4
- ISO-8859-5
- ISO-8859-6
- ISO-8859-7
- ISO-8859-8
- ISO-8859-9
- ISO-8859-10
- ISO-8859-13
- ISO-8859-14
- ISO-8859-15
- ISO-8859-16
- CP1250
- CP1251
- CP1252
- CP1253
- CP1254
- CP1255
- CP1256
- CP1257
- CP1258
- KOI8-R
- KOI8_U
- MACINTOSH
- IBM850
- TIS-620

Supported character sets via the optional dependency [encoding_rs](https://crates.io/crates/encoding_rs):

- SHIFT_JIS
- BIG5
- EUC-JP
- EUC-KR
- GB18030
- GBK
- ISO-2022-JP
- WINDOWS-874
- IBM-866

## License

Licensed under either of

* Apache License, Version 2.0 ([LICENSE-APACHE](LICENSE-APACHE) or http://www.apache.org/licenses/LICENSE-2.0)
* MIT license ([LICENSE-MIT](LICENSE-MIT) or http://opensource.org/licenses/MIT)

at your option.

## Copyright

Copyright (C) 2020-2022, Stalwart Labs Ltd.