Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/molybdenum-99/infoboxer
Wikipedia information extraction library
https://github.com/molybdenum-99/infoboxer
data-extraction mediawiki wikipedia
Last synced: 1 day ago
JSON representation
Wikipedia information extraction library
- Host: GitHub
- URL: https://github.com/molybdenum-99/infoboxer
- Owner: molybdenum-99
- License: mit
- Created: 2015-06-15T20:16:55.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2024-03-01T16:54:22.000Z (11 months ago)
- Last Synced: 2025-01-18T11:11:27.131Z (9 days ago)
- Topics: data-extraction, mediawiki, wikipedia
- Language: Ruby
- Size: 8.17 MB
- Stars: 174
- Watchers: 11
- Forks: 16
- Open Issues: 54
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Infoboxer
[![Gem Version](https://badge.fury.io/rb/infoboxer.svg)](http://badge.fury.io/rb/infoboxer)
![Build Status](https://github.com/molybdenum-99/infoboxer/workflows/CI/badge.svg?branch=master)
[![Coverage Status](https://coveralls.io/repos/molybdenum-99/infoboxer/badge.svg?branch=master&service=github)](https://coveralls.io/github/molybdenum-99/infoboxer?branch=master)
[![Code Climate](https://codeclimate.com/github/molybdenum-99/infoboxer/badges/gpa.svg)](https://codeclimate.com/github/molybdenum-99/infoboxer)
[![Infoboxer Gitter](https://badges.gitter.im/molybdenum-99/infoboxer.svg)](https://gitter.im/molybdenum-99/infoboxer)**Infoboxer** is pure-Ruby Wikipedia (and generic MediaWiki) client and
parser, targeting information extraction (hence the name).It can be useful in tasks like:
* get a plaintext abstract of an article (paragraphs before first heading);
* get structured data variables from page's **infobox**;
* list page's sections and count paragraphs, images and tables in them;
* convert some huge "comparison table" to data;
* and much, much more!The whole idea is: you can have any Wikipedia page as a parsed tree with
obvious structure, you can navigate that tree easily, and you have a
bunch of hi-level helpers method, so typical information extraction
tasks should be super-easy, one-liners in best cases._(For those already thinking "Why should you do this, we already have
DBPedia?" -- please, read "[Reasons](https://github.com/molybdenum-99/infoboxer/wiki/Reasons)"
page in our wiki.)_## Showcase
```ruby
Infoboxer.wikipedia.
get('Breaking Bad (season 1)').
sections('Episodes').templates(name: 'Episode table').
fetch('episodes').templates(name: /^Episode list/).
fetch_hashes('EpisodeNumber', 'EpisodeNumber2', 'Title', 'ShortSummary')
# => [{"EpisodeNumber"=>#, "EpisodeNumber2"=>#, "Title"=>#, "ShortSummary"=>#},
# {"EpisodeNumber"=>#, "EpisodeNumber2"=>#, "Title"=>#, "ShortSummary"=>#},
# ...and so on
```Do you _feel_ it now?
You also can take a look at [Showcase](https://github.com/molybdenum-99/infoboxer/wiki/Showcase).
## Usage
### Install gem
Install it as usual: `gem 'infoboxer'` in your Gemfile, then `bundle install`.
Or just `[sudo] gem install infoboxer` if you prefer.
### Grab the page
```ruby
# From English Wikipedia
page = Infoboxer.wikipedia.get('Argentina')
# or
page = Infoboxer.wp.get('Argentina')# From other language Wikipedia:
page = Infoboxer.wikipedia('fr').get('Argentina')# From any wiki with the same engine:
page = Infoboxer.wiki('http://companywiki.com').get('Our Product')
```See more examples and options at [Retrieving pages](https://github.com/molybdenum-99/infoboxer/wiki/Retrieving%20pages)
### Play with page
Basically, page is a tree of [Nodes](https://github.com/molybdenum-99/infoboxer/wiki/Nodes), you can think of it as some kind of
[DOM](https://en.wikipedia.org/wiki/Document_Object_Model).So, you can navigate it:
```ruby
# Simple traversing and inspect
node = page.children.first.children.first
node.to_tree
node.to_text# Various lookups
page.lookup(:Template, name: /^Infobox/)
```See [Tree navigation basics](https://github.com/molybdenum-99/infoboxer/wiki/Tree-navigation-basics).
On the top of the basic navigation Infoboxer adds some useful shortcuts
for convenience and brevity, which allows things like this:```ruby
page.section('Episodes').tables.first
```See [Navigation shortcuts](https://github.com/molybdenum-99/infoboxer/wiki/Navigation-shortcuts)
To put it all in one piece, also take a look at [Data extraction tips and tricks](https://github.com/molybdenum-99/infoboxer/wiki/Tips-and-tricks).
### infoboxer executable
Just try `infoboxer` command.
Without any options, it starts IRB session with infoboxer required and
included into main namespace.With `-w` option, it provides a shortcut to MediaWiki instance you want.
Like this:```
$ infoboxer -w https://en.wikipedia.org/w/api.php
> get('Argentina')
=> #