An open API service indexing awesome lists of open source software.

https://github.com/districtdatalabs/minke

Graph extraction and NLP analysis for Baleen Corpora
https://github.com/districtdatalabs/minke

Last synced: 9 months ago
JSON representation

Graph extraction and NLP analysis for Baleen Corpora

Awesome Lists containing this project

README

          

# Minke
**Graph extraction and NLP analysis for Baleen Corpora**

[![Build Status][travis_img]][travis_href]
[![Coverage Status][coveralls_img]][coverals_href]
[![Code Health][health_img]][health_href]
[![Stories in Ready][waffle_img]][waffle_href]

[![Minke Whale](docs/images/minke.jpg)][minkewhale.jpg]

## Quickstart

Minke provides a command line script called `sei` that allows you to interact with the Minke library and baleen corpora. For example, to sample a corpus to a smaller subset for testing or development you can do the following:

```
$ ./sei sample path/to/corpus path/to/sample
```

You can describe corpora using the `describe` command as follows:

```
$ ./sei describe path/to/corpus
```

And you can preprocess a corpus into a pickled corpus:

```
$ ./sei preprocess path/to/html/corpus path/to/pickled/corpus
```

Many more options and configurations are available; use `./sei --help` for more information and refer to the `conf/minke-example.conf` configuration file.

## About

The [Baleen](https://github.com/bbengfort/baleen) ingestion tool is used to create a corpus of web articles and blogs from RSS feeds. Minke extends Baleen with a library to perform text analysis and perform graph extraction on the exported corpora.

Baleen means “whale bone” and particularly refers to the straining bones that whales of the mysticeti suborder have. These bones filter food from water as the Baleen ingestion engine filters content from the web. [Minke whales](https://en.wikipedia.org/wiki/Minke_whale) are a specific species of [rorqual whales](https://seaworld.org/Animal-Info/Animal-InfoBooks/Baleen-Whales/Scientific-Classification), one of the shortest in fact. This library is named to indicate it's a short version of the larger Baleen codebase.

### Throughput

[![Throughput Graph](https://graphs.waffle.io/bbengfort/minke/throughput.svg)](https://waffle.io/bbengfort/minke/metrics)

### Attribution

The image used in this README, ["Minke whale 1"][minkewhale.jpg] by [Len2040](https://www.flickr.com/photos/lenjoh/) is licensed under [CC BY-ND 2.0](https://creativecommons.org/licenses/by-nd/2.0/)

[travis_img]: https://travis-ci.org/bbengfort/minke.svg?branch=master
[travis_href]: https://travis-ci.org/bbengfort/minke/
[coveralls_img]: https://coveralls.io/repos/github/bbengfort/minke/badge.svg?branch=master
[coverals_href]: https://coveralls.io/github/bbengfort/minke?branch=master
[health_img]: https://landscape.io/github/bbengfort/minke/master/landscape.svg?style=flat
[health_href]: https://landscape.io/github/bbengfort/minke/master
[waffle_img]: https://badge.waffle.io/bbengfort/minke.png?label=ready&title=Ready
[waffle_href]: https://waffle.io/bbengfort/minke
[minkewhale.jpg]: https://flic.kr/p/e9s7Z3