Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vmarkovtsev/ggmbox
Google Groups raw email crawler and parser
https://github.com/vmarkovtsev/ggmbox
Last synced: 22 days ago
JSON representation
Google Groups raw email crawler and parser
- Host: GitHub
- URL: https://github.com/vmarkovtsev/ggmbox
- Owner: vmarkovtsev
- License: mit
- Created: 2018-02-24T10:41:56.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2018-12-10T07:27:10.000Z (almost 6 years ago)
- Last Synced: 2024-06-20T10:22:13.109Z (5 months ago)
- Language: Python
- Homepage:
- Size: 14.6 KB
- Stars: 8
- Watchers: 3
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
ggmbox [![Build Status](https://travis-ci.org/vmarkovtsev/ggmbox.svg?branch=master)](https://travis-ci.org/vmarkovtsev/ggmbox) [![Build status](https://ci.appveyor.com/api/projects/status/x57poug9apd0bs2h?svg=true)](https://ci.appveyor.com/project/vmarkovtsev/ggmbox) [![Docker Build Status](https://img.shields.io/docker/build/vmarkovtsev/ggmbox.svg)](https://hub.docker.com/r/vmarkovtsev/ggmbox)
======Google Groups raw emails crawler and parser. Turbo speed and reliable!
The downloaded messages are in [RFC 822](https://www.ietf.org/rfc/rfc822.txt) format - taken verbatim
from the Google servers.### Installation
#### Docker
Docker is the simplest option. Go to [![DockerHub](https://img.shields.io/docker/build/vmarkovtsev/ggmbox.svg)](https://hub.docker.com/r/vmarkovtsev/ggmbox)
Prepend `docker run -it --rm vmarkovtsev/ggmbox` to all the commands in the "Usage" section.#### Crawler
Requirements: [Python 3](https://www.python.org/) and [Scrapy](https://scrapy.org/). Download
[`ggmbox.py`](ggmbox.py) file.#### Parser
Requirements: [Go](https://golang.org/).
```
go get -v github.com/vmarkovtsev/ggmbox
```### Usage
#### Crawler
```
scrapy runspider -a name=golang-nuts -o result.json -t json ggmbox.py
```Replace "golang-nuts" with the actual group name. The raw emails will be saved by default to the
corresponding directory.```
scrapy runspider -a name=chromium-dev -a prefix=a/chromium.org -o result.json -t json ggmbox.py
```Note the usage of "prefix" argument - it sets the name of the parent. Some groups require that.
#### Parser
```
./parse golang-nuts > dataset.csv
```Replace "golang-nuts" with the actual directory name with raw emails. The plain text threads will
be written to `dataset.csv`, one thread per line. Special characters are escaped.### Performance
#### Crawler
[golang-nuts](https://groups.google.com/d/forum/golang-nuts) group was fully fetched on 24/02/2018 with
30043 topics and 192654 messages **in 3 hours** at 1gbps connection speed.
The raw emails occupied 1.6 GB on disk.Compare to 1 day using [icy/google-group-crawler](https://github.com/icy/google-group-crawler),
it fetched only 63% and then stopped without any errors reported, or to
[henryk/gggd](https://github.com/henryk/gggd), it fetched only 3% within one hour and then
unexpectedly stopped too.#### Parser
It takes **7 seconds** to parse 1.6 GB of raw emails on a 32-core machine.
### Contributions
...are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) and [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md).
### License
[MIT](LICENSE).