Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dbpedia/list-extractor
Extract Data from Wikipedia Lists
https://github.com/dbpedia/list-extractor
Last synced: 8 days ago
JSON representation
Extract Data from Wikipedia Lists
- Host: GitHub
- URL: https://github.com/dbpedia/list-extractor
- Owner: dbpedia
- License: gpl-3.0
- Created: 2016-04-28T08:04:43.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-08-27T11:42:15.000Z (over 7 years ago)
- Last Synced: 2024-08-14T07:08:02.760Z (4 months ago)
- Language: Python
- Size: 115 MB
- Stars: 30
- Watchers: 12
- Forks: 11
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - dbpedia/list-extractor - Extract Data from Wikipedia Lists (Python)
README
# List-extractor - Extract Data from Wikipedia Lists
List-Extractor is a tool that can *extract information from wikipedia lists and form appropriate RDF triples from the list data.*
#### [GSoC'16 Detailed Progress available here](https://github.com/dbpedia/extraction-framework/wiki/GSoC_2016_Progress_Federica)
#### [Final commit of GSoC'16 can be found here](https://github.com/dbpedia/list-extractor/tree/55abff51634324bb657f531fe2e3bb699dfada74)
#### [GSoC'17 Work's detailed progress available here](https://github.com/dbpedia/list-extractor/wiki/GSoC-2017:-Krishanu-Konar-progress)
#### [List-Extractor wiki available here](https://github.com/dbpedia/list-extractor/wiki)
#### [GSoC'17 Final results and challenges available here](https://github.com/dbpedia/list-extractor/wiki/GSoC-2017:-Krishanu-Konar-progress#results)## How to run the tools
This project contains 2 differnt tools: `List-Extractor` and `Rules-Generator`.
Use `rulesGenerator.py` first to generate desired rules, and then use `listExtractor.py` to extract triples for wiki resources.
Alternatively, you can use only `listExtractor.py` and extract with existing default settings.For more details, refer to the documentation present in the `docs` folder. The sample generated datasets can be found **[here](https://drive.google.com/open?id=0BzDWYUiB6LUTYzdFU19BX2lUMjA).** Some example triples for different domains are present in `extracted` folder.
### List-Extractor:
`python listExtractor.py [collect_mode] [source] [language] [-c class_name]`
* `collect_mode` : `s` or `a`
* use `s` to specify a single resource or `a` for a class of resources in the next parameter.
* `source`: a string representing a class of resources from DBpedia ontology (find supported domains below), or a single Wikipedia page of an actor/writer.
* `language`: `en`, `it`, `de` etc. (for now, available only for some languages, for selected domains)
* a two-letter prefix corresponding to the desired language of Wikipedia pages and SPARQL endpoint to be queried.
* `-c --classname`: a string representing classnames you want to associate your resource with. Applicable only for `collect_mode="s"`.
**NOTE:** While extracting triples from multiple resources in a domain (`collect_mode = a`), using `Ctrl + C` will skip the current resource and move on to the next resource. To quit the extractor, use `Ctrl + \`.
## Examples:
* `python listExtractor.py a Writer it`
* `python listExtractor.py s William_Gibson en` : Uses the default inbuilt mapper-functions
* `python listExtractor.py s William_Gibson en -c CUSTOM_WRITER` : Uses the `CUSTOM_WRITER` mapping only to extract list elements.If successful, a .ttl file containing RDF statements about the specified source is created inside a subdirectory called `extracted`.
### Rules-Generator:
`python rulesGenerator.py`
* This is an interactive tool, select the options given in the menu for using the rules generator.
* While creating new mapping rules or mapper functions, make sure to follow the required format as suggested by the tool.
* Upon successful addition/modification, it will update the `settings.json` and `custom_mapper.json` so that the new user defined rules/functions can run with extractor.## Default Mapped Domains:
* English (`en`):
* **Person**: `Writer`, `Actor`, `MusicalArtist`, `Athelete`, `Polititcian`, `Manager`, `Coach`, `Celebrity` etc.
* **EducationalInstitution**: `University`, `School`, `College`, `Library`
* **PeriodicalLiterature**: `Magazines`, `Newspapers`, `AcademicJournals`
* **Group**: `Band`* Other (`it`, `de`, `es`):
* `Writer`, `Actor`, `MusicalArtist`* More Domains can be added using the `rulesGenerator.py` tool.
### Attributions for 3rd party tools:
This project uses 2 other existing open source projects.
* **JSONpedia**, a framework designed to simplify access at MediaWiki contents transforming everything into JSON. Such framework provides a library, a REST service and CLI tools to parse, convert, enrich and store WikiText documents.
The software is copyright of Michele Mostarda ([email protected]) and released under Apache 2 License.
Link : [JSONpedia](https://bitbucket.org/hardest/jsonpedia)* **JCommander**, a very small Java framework that makes it trivial to parse command line parameters.
Contact Cédric Beust ([email protected]) for more information. Released under Apache 2 License.
Link : [JCommander](https://github.com/cbeust/jcommander)### Requirements
* [Python 2.7](https://www.python.org/download/releases/2.7/)
* [RDFlib library](http://rdflib.readthedocs.io/en/stable/gettingstarted.html)
* Stable internet connection