https://github.com/philgyford/mailman-archive-scraper

Python script that scrapes public and private Mailman archive pages and republishes them to local files, and generates an RSS feed of recent emails.
https://github.com/philgyford/mailman-archive-scraper

mailman mailman-archive python rss rss-feed

Last synced: about 1 month ago
JSON representation

Python script that scrapes public and private Mailman archive pages and republishes them to local files, and generates an RSS feed of recent emails.

Host: GitHub
URL: https://github.com/philgyford/mailman-archive-scraper
Owner: philgyford
Created: 2009-03-27T18:28:59.000Z (about 16 years ago)
Default Branch: master
Last Pushed: 2021-08-23T14:34:23.000Z (over 3 years ago)
Last Synced: 2024-10-12T07:28:01.098Z (7 months ago)
Topics: mailman, mailman-archive, python, rss, rss-feed
Language: Python
Homepage: http://www.gyford.com/phil/writing/2009/05/13/mailman_archive_scraper.php
Size: 38.1 KB
Stars: 53
Watchers: 5
Forks: 18
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Mailman Archive Scraper

By Phil Gyford
v1.4.1, 2015-09-12

Latest version is available from

These scripts will scrape the archive pages generated by the [Mailman mailing list manager](http://www.gnu.org/software/mailman/index.html) and republish them as files on the local file system. In addition it can optionally do a number of things:

* Create an RSS feed of recent messages.
* Scrape private Mailman archives (if you have a valid email address and password).
* Remove all email addresses from the files (both those in '[email protected]' and 'phil at gyford dot com' format).
* Replace the URL for the 'more info on this list' links with another.
* Remove one or more levels of quoted emails.
* Search and replace any custom strings you specify.
* Add custom HTML into the `` section of the re-published pages.

Why would you want to do this? Three reasons:

1. You want to create your own HTML archive of a mailing list hosted elsewhere.

2. You want to create a public version of a private archive. We hope you have permission to do this of course. The tools mentioned above allow you to do things like anonymise names and phone numbers, remove email addresses, etc.

3. To have an RSS feed of recent messages.

There may be more efficient ways to do this if you have access to the database in which the Mailman archive is stored. If you don't, and can only access the web pages, this script is for you.

This script doesn't store any state locally between sessions so every time it's run it will have to scrape several pages, even if nothing's changed (particularly if you want an RSS feed of n recent messages). There is a half second delay between each fetch of a remote page, which slows things up but will hopefully prevent hammering web servers.

**There are caveats.** This seems to work with the few Mailman archives tried. I'm sure that some people will find problems with different installations -- unscrapeable HTML, different URLs and filepaths, etc. Feel free to suggest fixes.

## Installation

1. Put the directory containing the MailmanArchiveScraper.py script somewhere you want to run it from.

2. Make a copy of the `MailmanArchiveScraper-example.cfg` file and name it `MailmanArchiveScraper.cfg`.

3. Set the configuration options in that file (see below).

4. Install the required python modules, best done using [pip](https://pypi.python.org/pypi/pip):

$ pip install -r requirements.txt

5. Make sure the `MailmanArchiveScraper.py` script is executable (`chmod +x`). And the `MailmanGzTextScraper.py` script if you need that too.

## Configuration

There is help in the configuration file for each setting. The minimum things you'll need to set are:

1. `domain` -- The domain name that your Mailman pages are on.
2. `list_name` -- Name of your mailing list.
3. `email` and `password` -- Required if your Mailman archive is password protected.
4. `publish_dir` -- The path to the local directory the files should be republished to.
5. `publish_url` -- If you're going to publish the messages to a website.

By default the script uses a single `MailmanArchiveScraper.cfg` file in the same directory as the script, but you can specify multiple files, in different locations, instead (see Usage).

## Usage

Once configuration is done, run the script:

$ python ./MailmanArchiveScraper.py

All being well, the HTML archive files will be downloaded. Set the `verbose` setting in the configuration file to see a list of which files are being fetched.

If you want to download the plaintext files that Mailman saves for each month's messages (which may be gzipped), then run this script:

$ python ./MailmanGzTextScraper.py

After an initial run, you can run the script via cron to keep an updated copy of the HTML and/or text files. Note the `hours_to_go_back` setting in the config file, which wil probably need to be different for the first run compared to subsequent, regular runs.

By default the script runs once, using `MailmanArchiveScraper.cfg`. You can specify multiple config files and the script will run once for each file. For example, run the script with:

$ python ./MailmanArchiveScraper.py ~/lists/*.cfg

and the script will run once for each of the `*.cfg` files found in the `~/lists/` directory.

## What would also be nice:

* Sending each message on as an email. I can't see how to do this simply, given that we retain no state between times the script is run, so can't tell which emails haven't previously been sent.

## Contributors

Many thanks to:

* [CyberRodent](https://github.com/cyberrodent) for the text/gzip file archiving.
* [Danny O'Brien](https://github.com/dannyob) for https support.
* [Andrew Bibby](https://github.com/bibby) for supporting multiple config files, and better handling missing gzip archives.

## Version history

See all versions: https://github.com/philgyford/mailman-archive-scraper/releases

* **v1.4.1** 2015-09-12
Better handle missing gzip archives.

* **v1.4** 2015-03-15
Multiple config files can be specified.

* **v1.3** 2014-10-26
Add support for non-English language installations.

* **v1.2** 2013-10-12
The monthly text files, possibly gzipped, can be archived using a new script.

* **v1.13b** 2013-10-12
The script can now archive files served over https.

* **v1.13** 2013-10-12
Various improvements to the RSS files generated.

* **v1.12** 2013-10-12
The script can now generate an RSS feed of the most recent posts to the mailing list.

* **v1.0** 2013-10-12
Initial release

## Similar projects
- python - [doc_curation](https://github.com/sanskrit-coders/doc_curation/blob/master/doc_curation/mail_stream/mailman.py) dumps emails to markdown files organized by year/month/subject.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/philgyford/mailman-archive-scraper

Awesome Lists containing this project

README