https://github.com/sckott/textminer
text mine via Crossref's TDM
https://github.com/sckott/textminer
crossref literature text-mining
Last synced: 6 months ago
JSON representation
text mine via Crossref's TDM
- Host: GitHub
- URL: https://github.com/sckott/textminer
- Owner: sckott
- Created: 2015-08-22T02:48:40.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2017-08-17T16:52:19.000Z (about 8 years ago)
- Last Synced: 2025-04-18T18:29:28.967Z (6 months ago)
- Topics: crossref, literature, text-mining
- Language: Ruby
- Homepage:
- Size: 40 KB
- Stars: 4
- Watchers: 1
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
textminer
=========[](https://rubygems.org/gems/textminer)
[](https://travis-ci.org/sckott/textminer)
[](http://codecov.io/github/sckott/textminer?branch=master)__`textminer` helps you text mine through Crossref's TDM (Text & Data Mining) services:__
## Changes
For changes see the [CHANGELOG][changelog]
## gem API
* `Textiner.search` - search by DOI, query string, filters, etc. to get Crossref metadata, which you can use downstream to get full text links. This method essentially wraps `Serrano.works()`, but only a subset of params - this interface may change depending on feedback.
* `Textiner.fetch` - Fetch full text given a url, supports Crossref's Text and Data Mining service
* `Textiner.extract` - Extract text from a pdf## Install
### Release version
```
gem install textminer
```### Development version
```
git clone git@github.com:sckott/textminer.git
cd textminer
rake install
```## Examples
### Within Ruby
#### Search
Search by DOI
```ruby
require 'textminer'
# link to full text available
Textminer.search(doi: '10.7554/elife.06430')
# no link to full text available
Textminer.search(doi: "10.1371/journal.pone.0000308")
```Many DOIs at once
```ruby
require 'serrano'
dois = Serrano.random_dois(sample: 6)
Textminer.search(doi: dois)
```Search with filters
```ruby
Textminer.search(filter: {has_full_text: true})
```#### Get full text links
The object returned form `Textminer.search` is a class, which has methods for pulling out all links, xml only, pdf only, or plain text only
```ruby
x = Textminer.search(filter: {has_full_text: true})
x.links_xml
x.links_pdf
x.links_plain
```#### Fetch full text
`Textminer.fetch()` gets full text based on URL input. We determine how to pull down and parse the content based on content type.
```ruby
# get some metadata
res = Textminer.search(member: 2258, filter: {has_full_text: true});
# get links
links = res.links_xml(true);
# Get full text for an article
res = Textminer.fetch(url: links[0]);
# url
res.url
# file path
res.path
# content type
res.type
# parse content
res.parse
```#### Extract text from PDF
`Textminer.extract()` extracts text from a pdf, given a path for a pdf
```ruby
res = Textminer.search(member: 2258, filter: {has_full_text: true});
links = res.links_pdf(true);
res = Textminer.fetch(url: links[0]);
Textminer.extract(res.path)
```### On the CLI
Coming soon...
## To do
* CLI executable
* better test suite
* better documentation[changelog]: https://github.com/sckott/textminer/blob/master/CHANGELOG.md