https://github.com/districtdatalabs/minke
Graph extraction and NLP analysis for Baleen Corpora
https://github.com/districtdatalabs/minke
Last synced: 9 months ago
JSON representation
Graph extraction and NLP analysis for Baleen Corpora
- Host: GitHub
- URL: https://github.com/districtdatalabs/minke
- Owner: DistrictDataLabs
- License: mit
- Created: 2016-04-19T12:33:24.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-09-08T15:11:10.000Z (over 9 years ago)
- Last Synced: 2025-04-05T02:01:37.500Z (9 months ago)
- Language: Python
- Size: 10.3 MB
- Stars: 18
- Watchers: 4
- Forks: 11
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Minke
**Graph extraction and NLP analysis for Baleen Corpora**
[![Build Status][travis_img]][travis_href]
[![Coverage Status][coveralls_img]][coverals_href]
[![Code Health][health_img]][health_href]
[![Stories in Ready][waffle_img]][waffle_href]
[][minkewhale.jpg]
## Quickstart
Minke provides a command line script called `sei` that allows you to interact with the Minke library and baleen corpora. For example, to sample a corpus to a smaller subset for testing or development you can do the following:
```
$ ./sei sample path/to/corpus path/to/sample
```
You can describe corpora using the `describe` command as follows:
```
$ ./sei describe path/to/corpus
```
And you can preprocess a corpus into a pickled corpus:
```
$ ./sei preprocess path/to/html/corpus path/to/pickled/corpus
```
Many more options and configurations are available; use `./sei --help` for more information and refer to the `conf/minke-example.conf` configuration file.
## About
The [Baleen](https://github.com/bbengfort/baleen) ingestion tool is used to create a corpus of web articles and blogs from RSS feeds. Minke extends Baleen with a library to perform text analysis and perform graph extraction on the exported corpora.
Baleen means “whale bone” and particularly refers to the straining bones that whales of the mysticeti suborder have. These bones filter food from water as the Baleen ingestion engine filters content from the web. [Minke whales](https://en.wikipedia.org/wiki/Minke_whale) are a specific species of [rorqual whales](https://seaworld.org/Animal-Info/Animal-InfoBooks/Baleen-Whales/Scientific-Classification), one of the shortest in fact. This library is named to indicate it's a short version of the larger Baleen codebase.
### Throughput
[](https://waffle.io/bbengfort/minke/metrics)
### Attribution
The image used in this README, ["Minke whale 1"][minkewhale.jpg] by [Len2040](https://www.flickr.com/photos/lenjoh/) is licensed under [CC BY-ND 2.0](https://creativecommons.org/licenses/by-nd/2.0/)
[travis_img]: https://travis-ci.org/bbengfort/minke.svg?branch=master
[travis_href]: https://travis-ci.org/bbengfort/minke/
[coveralls_img]: https://coveralls.io/repos/github/bbengfort/minke/badge.svg?branch=master
[coverals_href]: https://coveralls.io/github/bbengfort/minke?branch=master
[health_img]: https://landscape.io/github/bbengfort/minke/master/landscape.svg?style=flat
[health_href]: https://landscape.io/github/bbengfort/minke/master
[waffle_img]: https://badge.waffle.io/bbengfort/minke.png?label=ready&title=Ready
[waffle_href]: https://waffle.io/bbengfort/minke
[minkewhale.jpg]: https://flic.kr/p/e9s7Z3