{"id":15555304,"url":"https://github.com/sckott/textminer","last_synced_at":"2025-10-16T00:59:33.854Z","repository":{"id":36879991,"uuid":"41186975","full_name":"sckott/textminer","owner":"sckott","description":"text mine via Crossref's TDM","archived":false,"fork":false,"pushed_at":"2017-08-17T16:52:19.000Z","size":41,"stargazers_count":4,"open_issues_count":6,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-18T18:29:28.967Z","etag":null,"topics":["crossref","literature","text-mining"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sckott.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-08-22T02:48:40.000Z","updated_at":"2017-08-17T16:46:35.000Z","dependencies_parsed_at":"2022-09-11T22:12:07.452Z","dependency_job_id":null,"html_url":"https://github.com/sckott/textminer","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sckott%2Ftextminer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sckott%2Ftextminer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sckott%2Ftextminer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sckott%2Ftextminer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sckott","download_url":"https://codeload.github.com/sckott/textminer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250504085,"owners_count":21441527,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crossref","literature","text-mining"],"created_at":"2024-10-02T15:08:10.459Z","updated_at":"2025-10-16T00:59:33.801Z","avatar_url":"https://github.com/sckott.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"textminer\n=========\n\n[![gem version](https://img.shields.io/gem/v/textminer.svg)](https://rubygems.org/gems/textminer)\n[![Build Status](https://travis-ci.org/sckott/textminer.svg?branch=master)](https://travis-ci.org/sckott/textminer)\n[![codecov.io](http://codecov.io/github/sckott/textminer/coverage.svg?branch=master)](http://codecov.io/github/sckott/textminer?branch=master)\n\n__`textminer` helps you text mine through Crossref's TDM (Text \u0026 Data Mining) services:__\n\n## Changes\n\nFor changes see the [CHANGELOG][changelog]\n\n## gem API\n\n* `Textiner.search` - search by DOI, query string, filters, etc. to get Crossref metadata, which you can use downstream to get full text links. This method essentially wraps `Serrano.works()`, but only a subset of params - this interface may change depending on feedback.\n* `Textiner.fetch` - Fetch full text given a url, supports Crossref's Text and Data Mining service\n* `Textiner.extract` - Extract text from a pdf\n\n## Install\n\n### Release version\n\n```\ngem install textminer\n```\n\n### Development version\n\n```\ngit clone git@github.com:sckott/textminer.git\ncd textminer\nrake install\n```\n\n## Examples\n\n### Within Ruby\n\n#### Search\n\nSearch by DOI\n\n```ruby\nrequire 'textminer'\n# link to full text available\nTextminer.search(doi: '10.7554/elife.06430')\n# no link to full text available\nTextminer.search(doi: \"10.1371/journal.pone.0000308\")\n```\n\nMany DOIs at once\n\n```ruby\nrequire 'serrano'\ndois = Serrano.random_dois(sample: 6)\nTextminer.search(doi: dois)\n```\n\nSearch with filters\n\n```ruby\nTextminer.search(filter: {has_full_text: true})\n```\n\n#### Get full text links\n\nThe object returned form `Textminer.search` is a class, which has methods for pulling out all links, xml only, pdf only, or plain text only\n\n```ruby\nx = Textminer.search(filter: {has_full_text: true})\nx.links_xml\nx.links_pdf\nx.links_plain\n```\n\n#### Fetch full text\n\n`Textminer.fetch()` gets full text based on URL input. We determine how to pull down and parse the content based on content type.\n\n```ruby\n# get some metadata\nres = Textminer.search(member: 2258, filter: {has_full_text: true});\n# get links\nlinks = res.links_xml(true);\n# Get full text for an article\nres = Textminer.fetch(url: links[0]);\n# url\nres.url\n# file path\nres.path\n# content type\nres.type\n# parse content\nres.parse\n```\n\n#### Extract text from PDF\n\n`Textminer.extract()` extracts text from a pdf, given a path for a pdf\n\n```ruby\nres = Textminer.search(member: 2258, filter: {has_full_text: true});\nlinks = res.links_pdf(true);\nres = Textminer.fetch(url: links[0]);\nTextminer.extract(res.path)\n```\n\n### On the CLI\n\nComing soon...\n\n## To do\n\n* CLI executable\n* better test suite\n* better documentation\n\n[changelog]: https://github.com/sckott/textminer/blob/master/CHANGELOG.md\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsckott%2Ftextminer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsckott%2Ftextminer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsckott%2Ftextminer/lists"}