https://github.com/sckott/textminer

text mine via Crossref's TDM
https://github.com/sckott/textminer

crossref literature text-mining

Last synced: 6 months ago
JSON representation

text mine via Crossref's TDM

Host: GitHub
URL: https://github.com/sckott/textminer
Owner: sckott
Created: 2015-08-22T02:48:40.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2017-08-17T16:52:19.000Z (about 8 years ago)
Last Synced: 2025-04-18T18:29:28.967Z (6 months ago)
Topics: crossref, literature, text-mining
Language: Ruby
Homepage:
Size: 40 KB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md

Awesome Lists containing this project

README

          textminer

=========

[![gem version](https://img.shields.io/gem/v/textminer.svg)](https://rubygems.org/gems/textminer)

[![Build Status](https://travis-ci.org/sckott/textminer.svg?branch=master)](https://travis-ci.org/sckott/textminer)

[![codecov.io](http://codecov.io/github/sckott/textminer/coverage.svg?branch=master)](http://codecov.io/github/sckott/textminer?branch=master)

__`textminer` helps you text mine through Crossref's TDM (Text & Data Mining) services:__

## Changes

For changes see the [CHANGELOG][changelog]

## gem API

* `Textiner.search` - search by DOI, query string, filters, etc. to get Crossref metadata, which you can use downstream to get full text links. This method essentially wraps `Serrano.works()`, but only a subset of params - this interface may change depending on feedback.

* `Textiner.fetch` - Fetch full text given a url, supports Crossref's Text and Data Mining service

* `Textiner.extract` - Extract text from a pdf

## Install

### Release version

```

gem install textminer

```

### Development version

```

git clone git@github.com:sckott/textminer.git

cd textminer

rake install

```

## Examples

### Within Ruby

#### Search

Search by DOI

```ruby

require 'textminer'

# link to full text available

Textminer.search(doi: '10.7554/elife.06430')

# no link to full text available

Textminer.search(doi: "10.1371/journal.pone.0000308")

```

Many DOIs at once

```ruby

require 'serrano'

dois = Serrano.random_dois(sample: 6)

Textminer.search(doi: dois)

```

Search with filters

```ruby

Textminer.search(filter: {has_full_text: true})

```

#### Get full text links

The object returned form `Textminer.search` is a class, which has methods for pulling out all links, xml only, pdf only, or plain text only

```ruby

x = Textminer.search(filter: {has_full_text: true})

x.links_xml

x.links_pdf

x.links_plain

```

#### Fetch full text

`Textminer.fetch()` gets full text based on URL input. We determine how to pull down and parse the content based on content type.

```ruby

# get some metadata

res = Textminer.search(member: 2258, filter: {has_full_text: true});

# get links

links = res.links_xml(true);

# Get full text for an article

res = Textminer.fetch(url: links[0]);

# url

res.url

# file path

res.path

# content type

res.type

# parse content

res.parse

```

#### Extract text from PDF

`Textminer.extract()` extracts text from a pdf, given a path for a pdf

```ruby

res = Textminer.search(member: 2258, filter: {has_full_text: true});

links = res.links_pdf(true);

res = Textminer.fetch(url: links[0]);

Textminer.extract(res.path)

```

### On the CLI

Coming soon...

## To do

* CLI executable

* better test suite

* better documentation

[changelog]: https://github.com/sckott/textminer/blob/master/CHANGELOG.md

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sckott/textminer

Awesome Lists containing this project

README