https://github.com/arshad115/uni-mannheim-masters-thesis
Public repo for my masters thesis "Identification of Polysemous Entities in a Large Scale Database (WebIsALOD)" for University of Mannheim Masters in Business Informatics, Chair of Data and Web Science.
https://github.com/arshad115/uni-mannheim-masters-thesis
classification classification-algorithm data-science gensim hdp latent-dirichlet-allocation lda lda-models masters-thesis polysemous-entities polysemy thesis topic-modeling uni-mannheim webisadb wikipedia-data
Last synced: 3 months ago
JSON representation
Public repo for my masters thesis "Identification of Polysemous Entities in a Large Scale Database (WebIsALOD)" for University of Mannheim Masters in Business Informatics, Chair of Data and Web Science.
- Host: GitHub
- URL: https://github.com/arshad115/uni-mannheim-masters-thesis
- Owner: arshad115
- License: gpl-3.0
- Created: 2020-06-30T18:36:26.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-02-08T00:45:17.000Z (over 2 years ago)
- Last Synced: 2025-01-12T09:14:50.965Z (5 months ago)
- Topics: classification, classification-algorithm, data-science, gensim, hdp, latent-dirichlet-allocation, lda, lda-models, masters-thesis, polysemous-entities, polysemy, thesis, topic-modeling, uni-mannheim, webisadb, wikipedia-data
- Language: Python
- Homepage:
- Size: 74.6 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Uni Mannheim - Business Informatics Masters Thesis - Arshad Mehmood
Public repo for my masters thesis for the chair of Data and Web science:### Identification of Polysemous Entities in a Large Scale Database (WebIsALOD)
First of all the [WebIsALOD](http://data.dws.informatik.uni-mannheim.de/webisa/webisalod-instances.nq.gz) dataset should be downloaded, extracted and saved in the `data` folder.
1. Fix the dataset URI's:
To fix the dataset URI's run the python script called `fix_dataset_uris.py`.2. Extract concept documents files and save preprocessed clean files:
To save the clean preprocessed files run the python script called `Read_And_Clean.py`.
3. Download Wikipedia data:
Use the following script to download the latest Wikipedia English articles dump:
```bash
curl –O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
```4. Preprocess Wikipedia data using [Gensim]():
To preprocess the Wikipedia data use the [Gensim]()'s script:
```bash
python -m gensim.scripts.make_wiki
```5. Train LDA model with Wikipedia data:
`wiki_wordids.txt` and `wiki_tfidf.mm` files generated in the previous step are required by the models using Wikipedia data.
To train the LDA models with Wikipedia data, run the python script called `wiki_lda.py`.
6. Train LDA model with [WebIsALOD](http://data.dws.informatik.uni-mannheim.de/webisa/webisalod-instances.nq.gz) data:
To train the LDA models with [WebIsALOD](http://data.dws.informatik.uni-mannheim.de/webisa/webisalod-instances.nq.gz) data, run the python script called `webisalod_lda.py`.
7. Train HDP model:
To train the LDA models with Wikipedia data, run the python script called `wiki_hdp.py`.
8. Classification using only topic modeling:
To run the classification model with only topic modeling, run the python script called `polysemous_words.py`.
9. Classification using topic modeling and supervised machine learning algorithms:
To run the classification model with only topic modeling, run the python script called `supervised_classifier.py`.