https://github.com/ironholds/mwparser
A parser for Wikimarkup
https://github.com/ironholds/mwparser
parser r wikimarkup wikimedia wikipedia
Last synced: about 1 year ago
JSON representation
A parser for Wikimarkup
- Host: GitHub
- URL: https://github.com/ironholds/mwparser
- Owner: Ironholds
- License: other
- Created: 2017-05-25T19:12:35.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-08-02T01:51:30.000Z (almost 9 years ago)
- Last Synced: 2025-03-31T19:21:19.550Z (about 1 year ago)
- Topics: parser, r, wikimarkup, wikimedia, wikipedia
- Language: R
- Size: 29.3 KB
- Stars: 7
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Wikimarkup parsing in R
A package for parsing, chucking and modifying wikimarkup in R.
__Author:__ Oliver Keyes
__License:__ [MIT](http://opensource.org/licenses/MIT)
__Status:__ In development
[](https://travis-ci.org/Ironholds/mwparser)
### Description
Wikimarkup is the language used on Wikipedia and similar projects, and as such contains
a lot of valuable data both for scientists studying collaborative systems and people
studying things documented on or in Wikipedia. `mwparser` parses wikimarkup, allowing a
user to filter down to specific types of tags such as links or templates, and then extract components of those tags.
### Example
```
library(mwparser)
library(magrittr)
wikitext <- "this is wikitext with \n [[a|link]] [[or|two]]"
link_paths <- parse_wikitext(wikitext) %>%
get_wikilinks %>%
wikilink_paths(text = TRUE)
link_paths
[1] "a" "or"
```
### Installation
`mwparser` depends on two things; the [reticulate](https://rstudio.github.io/reticulate/) R package and the Python library [mwparserfromhell](https://github.com/earwig/mwparserfromhell). To install the whole stack, assuming you have `pip`:
```
# In the terminal
pip install mwparserfromhell
# In R
install.packages("reticulate")
devtools::install_github("ropenscilabs/mwparser")
```
With that, you're good to go!
### Future work
The library currently has accessors to extract most common types of attribute and components from within them. The next step is exposing the rest of `mwparserfromhell`'s functionality, which includes:
1. More accessors
2. The ability to modify wikimarkup pages and their component elements;
3. The ability to write out the resulting, modified markup.
Some time after that the goal is to integrate MediaWiki's actual parser, as a replacement for the `mwparserfromhell` dependency, using [piton](https://github.com/Ironholds/piton).