Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aajanki/wiki-acronyms
Find backronyms in Wikipedia
https://github.com/aajanki/wiki-acronyms
Last synced: about 2 months ago
JSON representation
Find backronyms in Wikipedia
- Host: GitHub
- URL: https://github.com/aajanki/wiki-acronyms
- Owner: aajanki
- License: other
- Created: 2015-07-30T18:10:37.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2015-07-30T18:11:26.000Z (over 9 years ago)
- Last Synced: 2024-04-24T03:01:45.596Z (9 months ago)
- Language: Scala
- Size: 5.85 MB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Introduction
This script mines Wikipedia for
[backronyms](https://en.wikipedia.org/wiki/Backronym) or word
sequences where the consecutive words start with the given letters.
Often the results are quite funny.For example, backronyms for LOL are word triplets with words starting
with the letters L, O and L, such as "loss of life" and "Lady of
Lourdes".# Installing
## Required software
* [sbt](http://www.scala-sbt.org/)
* [Apache Spark](https://spark.apache.org/)
* [wikipedia-extractor](https://github.com/bwbaugh/wikipedia-extractor)## Downloading and preprocessing the Wikipedia corpus
```sh
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
git clone [email protected]:bwbaugh/wikipedia-extractor.git
python wikipedia-extractor/WikiExtractor.py -b 10M -c --threads 4 -o wikitexts enwiki-latest-pages-articles.xml.bz2
```A sample of the final preprocessed format is included in the directory
wikitexts.sample.## Compiling
```sh
sbt package
```# Usage
Submit the compiled JAR file to a Spark instance. The application
expects two arguments: a path to the preprocessed Wikipedia files and
a comma-separated list of acronyms.```sh
$SPARKDIR/bin/spark-submit --class AcronymApp wiki-acronyms.jar wiki_files acronyms
```## Example
To find explanations for the acronyms LOL, WTF and FAQ on a local
Spark instance, run:```sh
$SPARKDIR/bin/spark-submit --class AcronymApp --master local[4] --driver-memory 1G --conf spark.storage.memoryFraction=0.05 target/scala-2.11/wiki-acronyms_2.11-1.0.jar 'wikitexts/*/wiki*.bz2' 'LOL,WTF,FAQ' > results
```