https://github.com/molybdenum-99/infoboxer

Wikipedia information extraction library
https://github.com/molybdenum-99/infoboxer

data-extraction mediawiki wikipedia

Last synced: 7 months ago
JSON representation

Wikipedia information extraction library

Host: GitHub
URL: https://github.com/molybdenum-99/infoboxer
Owner: molybdenum-99
License: mit
Created: 2015-06-15T20:16:55.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2024-03-01T16:54:22.000Z (over 1 year ago)
Last Synced: 2025-03-29T00:08:21.658Z (7 months ago)
Topics: data-extraction, mediawiki, wikipedia
Language: Ruby
Size: 8.17 MB
Stars: 175
Watchers: 10
Forks: 13
Open Issues: 54
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          # Infoboxer

[![Gem Version](https://badge.fury.io/rb/infoboxer.svg)](http://badge.fury.io/rb/infoboxer)

![Build Status](https://github.com/molybdenum-99/infoboxer/workflows/CI/badge.svg?branch=master)

[![Coverage Status](https://coveralls.io/repos/molybdenum-99/infoboxer/badge.svg?branch=master&service=github)](https://coveralls.io/github/molybdenum-99/infoboxer?branch=master)

[![Code Climate](https://codeclimate.com/github/molybdenum-99/infoboxer/badges/gpa.svg)](https://codeclimate.com/github/molybdenum-99/infoboxer)

[![Infoboxer Gitter](https://badges.gitter.im/molybdenum-99/infoboxer.svg)](https://gitter.im/molybdenum-99/infoboxer)

**Infoboxer** is pure-Ruby Wikipedia (and generic MediaWiki) client and

parser, targeting information extraction (hence the name).

It can be useful in tasks like:

* get a plaintext abstract of an article (paragraphs before first heading);

* get structured data variables from page's **infobox**;

* list page's sections and count paragraphs, images and tables in them;

* convert some huge "comparison table" to data;

* and much, much more!

The whole idea is: you can have any Wikipedia page as a parsed tree with

obvious structure, you can navigate that tree easily, and you have a

bunch of hi-level helpers method, so typical information extraction

tasks should be super-easy, one-liners in best cases.

_(For those already thinking "Why should you do this, we already have

DBPedia?" -- please, read "[Reasons](https://github.com/molybdenum-99/infoboxer/wiki/Reasons)"

page in our wiki.)_

## Showcase

```ruby

Infoboxer.wikipedia.

  get('Breaking Bad (season 1)').

  sections('Episodes').templates(name: 'Episode table').

  fetch('episodes').templates(name: /^Episode list/).

  fetch_hashes('EpisodeNumber', 'EpisodeNumber2', 'Title', 'ShortSummary')

# => [{"EpisodeNumber"=>#, "EpisodeNumber2"=>#, "Title"=>#, "ShortSummary"=>#},

#     {"EpisodeNumber"=>#, "EpisodeNumber2"=>#, "Title"=>#, "ShortSummary"=>#},

#     ...and so on

```

Do you _feel_ it now?

You also can take a look at [Showcase](https://github.com/molybdenum-99/infoboxer/wiki/Showcase).

## Usage

### Install gem

Install it as usual: `gem 'infoboxer'` in your Gemfile, then `bundle install`.

Or just `[sudo] gem install infoboxer` if you prefer.

### Grab the page

```ruby

# From English Wikipedia

page = Infoboxer.wikipedia.get('Argentina')

# or

page = Infoboxer.wp.get('Argentina')

# From other language Wikipedia:

page = Infoboxer.wikipedia('fr').get('Argentina')

# From any wiki with the same engine:

page = Infoboxer.wiki('http://companywiki.com').get('Our Product')

```

See more examples and options at [Retrieving pages](https://github.com/molybdenum-99/infoboxer/wiki/Retrieving%20pages)

### Play with page

Basically, page is a tree of [Nodes](https://github.com/molybdenum-99/infoboxer/wiki/Nodes), you can think of it as some kind of

[DOM](https://en.wikipedia.org/wiki/Document_Object_Model).

So, you can navigate it:

```ruby

# Simple traversing and inspect

node = page.children.first.children.first

node.to_tree

node.to_text

# Various lookups

page.lookup(:Template, name: /^Infobox/)

```

See [Tree navigation basics](https://github.com/molybdenum-99/infoboxer/wiki/Tree-navigation-basics).

On the top of the basic navigation Infoboxer adds some useful shortcuts

for convenience and brevity, which allows things like this:

```ruby

page.section('Episodes').tables.first

```

See [Navigation shortcuts](https://github.com/molybdenum-99/infoboxer/wiki/Navigation-shortcuts)

To put it all in one piece, also take a look at [Data extraction tips and tricks](https://github.com/molybdenum-99/infoboxer/wiki/Tips-and-tricks).

### infoboxer executable

Just try `infoboxer` command.

Without any options, it starts IRB session with infoboxer required and

included into main namespace.

With `-w` option, it provides a shortcut to MediaWiki instance you want.

Like this:

```

$ infoboxer -w https://en.wikipedia.org/w/api.php

> get('Argentina')

 => #

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/molybdenum-99/infoboxer

Awesome Lists containing this project

README