https://github.com/dbpedia/list-extractor

Extract Data from Wikipedia Lists
https://github.com/dbpedia/list-extractor

Last synced: 3 months ago
JSON representation

Extract Data from Wikipedia Lists

Host: GitHub
URL: https://github.com/dbpedia/list-extractor
Owner: dbpedia
License: gpl-3.0
Created: 2016-04-28T08:04:43.000Z (over 9 years ago)
Default Branch: master
Last Pushed: 2017-08-27T11:42:15.000Z (about 8 years ago)
Last Synced: 2024-08-14T07:08:02.760Z (about 1 year ago)
Language: Python
Size: 115 MB
Stars: 30
Watchers: 12
Forks: 11
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

jimsghstars - dbpedia/list-extractor - Extract Data from Wikipedia Lists (Python)

README

# List-extractor - Extract Data from Wikipedia Lists

List-Extractor is a tool that can *extract information from wikipedia lists and form appropriate RDF triples from the list data.*

#### [GSoC'16 Detailed Progress available here](https://github.com/dbpedia/extraction-framework/wiki/GSoC_2016_Progress_Federica)
#### [Final commit of GSoC'16 can be found here](https://github.com/dbpedia/list-extractor/tree/55abff51634324bb657f531fe2e3bb699dfada74)
#### [GSoC'17 Work's detailed progress available here](https://github.com/dbpedia/list-extractor/wiki/GSoC-2017:-Krishanu-Konar-progress)
#### [List-Extractor wiki available here](https://github.com/dbpedia/list-extractor/wiki)
#### [GSoC'17 Final results and challenges available here](https://github.com/dbpedia/list-extractor/wiki/GSoC-2017:-Krishanu-Konar-progress#results)

## How to run the tools

This project contains 2 differnt tools: `List-Extractor` and `Rules-Generator`.
Use `rulesGenerator.py` first to generate desired rules, and then use `listExtractor.py` to extract triples for wiki resources.
Alternatively, you can use only `listExtractor.py` and extract with existing default settings.

For more details, refer to the documentation present in the `docs` folder. The sample generated datasets can be found **[here](https://drive.google.com/open?id=0BzDWYUiB6LUTYzdFU19BX2lUMjA).** Some example triples for different domains are present in `extracted` folder.

### List-Extractor:

`python listExtractor.py [collect_mode] [source] [language] [-c class_name]`

* `collect_mode` : `s` or `a`

* use `s` to specify a single resource or `a` for a class of resources in the next parameter.

* `source`: a string representing a class of resources from DBpedia ontology (find supported domains below), or a single Wikipedia page of an actor/writer.

* `language`: `en`, `it`, `de` etc. (for now, available only for some languages, for selected domains)

* a two-letter prefix corresponding to the desired language of Wikipedia pages and SPARQL endpoint to be queried.

* `-c --classname`: a string representing classnames you want to associate your resource with. Applicable only for `collect_mode="s"`.

**NOTE:** While extracting triples from multiple resources in a domain (`collect_mode = a`), using `Ctrl + C` will skip the current resource and move on to the next resource. To quit the extractor, use `Ctrl + \`.

## Examples:

* `python listExtractor.py a Writer it`
* `python listExtractor.py s William_Gibson en` : Uses the default inbuilt mapper-functions
* `python listExtractor.py s William_Gibson en -c CUSTOM_WRITER` : Uses the `CUSTOM_WRITER` mapping only to extract list elements.

If successful, a .ttl file containing RDF statements about the specified source is created inside a subdirectory called `extracted`.

### Rules-Generator:

`python rulesGenerator.py`

* This is an interactive tool, select the options given in the menu for using the rules generator.
* While creating new mapping rules or mapper functions, make sure to follow the required format as suggested by the tool.
* Upon successful addition/modification, it will update the `settings.json` and `custom_mapper.json` so that the new user defined rules/functions can run with extractor.

## Default Mapped Domains:

* English (`en`):
* **Person**: `Writer`, `Actor`, `MusicalArtist`, `Athelete`, `Polititcian`, `Manager`, `Coach`, `Celebrity` etc.
* **EducationalInstitution**: `University`, `School`, `College`, `Library`
* **PeriodicalLiterature**: `Magazines`, `Newspapers`, `AcademicJournals`
* **Group**: `Band`

* Other (`it`, `de`, `es`):
* `Writer`, `Actor`, `MusicalArtist`

* More Domains can be added using the `rulesGenerator.py` tool.

### Attributions for 3rd party tools:

This project uses 2 other existing open source projects.

* **JSONpedia**, a framework designed to simplify access at MediaWiki contents transforming everything into JSON. Such framework provides a library, a REST service and CLI tools to parse, convert, enrich and store WikiText documents.

The software is copyright of Michele Mostarda (me@michelemostarda.it) and released under Apache 2 License.
Link : [JSONpedia](https://bitbucket.org/hardest/jsonpedia)

* **JCommander**, a very small Java framework that makes it trivial to parse command line parameters.

Contact Cédric Beust (cedric@beust.com) for more information. Released under Apache 2 License.
Link : [JCommander](https://github.com/cbeust/jcommander)

### Requirements
* [Python 2.7](https://www.python.org/download/releases/2.7/)
* [RDFlib library](http://rdflib.readthedocs.io/en/stable/gettingstarted.html)
* Stable internet connection

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dbpedia/list-extractor

Awesome Lists containing this project

README