https://github.com/arshad115/uni-mannheim-masters-thesis

Public repo for my masters thesis "Identification of Polysemous Entities in a Large Scale Database (WebIsALOD)" for University of Mannheim Masters in Business Informatics, Chair of Data and Web Science.
https://github.com/arshad115/uni-mannheim-masters-thesis

classification classification-algorithm data-science gensim hdp latent-dirichlet-allocation lda lda-models masters-thesis polysemous-entities polysemy thesis topic-modeling uni-mannheim webisadb wikipedia-data

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/arshad115/uni-mannheim-masters-thesis
Owner: arshad115
License: gpl-3.0
Created: 2020-06-30T18:36:26.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2023-02-08T00:45:17.000Z (over 2 years ago)
Last Synced: 2025-01-12T09:14:50.965Z (5 months ago)
Topics: classification, classification-algorithm, data-science, gensim, hdp, latent-dirichlet-allocation, lda, lda-models, masters-thesis, polysemous-entities, polysemy, thesis, topic-modeling, uni-mannheim, webisadb, wikipedia-data
Language: Python
Homepage:
Size: 74.6 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Uni Mannheim - Business Informatics Masters Thesis - Arshad Mehmood
Public repo for my masters thesis for the chair of Data and Web science:

### Identification of Polysemous Entities in a Large Scale Database (WebIsALOD)

First of all the [WebIsALOD](http://data.dws.informatik.uni-mannheim.de/webisa/webisalod-instances.nq.gz) dataset should be downloaded, extracted and saved in the `data` folder.

1. Fix the dataset URI's:
To fix the dataset URI's run the python script called `fix_dataset_uris.py`.

2. Extract concept documents files and save preprocessed clean files:

To save the clean preprocessed files run the python script called `Read_And_Clean.py`.

3. Download Wikipedia data:

Use the following script to download the latest Wikipedia English articles dump:

```bash
curl –O https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
```

4. Preprocess Wikipedia data using [Gensim]():

To preprocess the Wikipedia data use the [Gensim]()'s script:

```bash
python -m gensim.scripts.make_wiki
```

5. Train LDA model with Wikipedia data:

`wiki_wordids.txt` and `wiki_tfidf.mm` files generated in the previous step are required by the models using Wikipedia data.

To train the LDA models with Wikipedia data, run the python script called `wiki_lda.py`.

6. Train LDA model with [WebIsALOD](http://data.dws.informatik.uni-mannheim.de/webisa/webisalod-instances.nq.gz) data:

To train the LDA models with [WebIsALOD](http://data.dws.informatik.uni-mannheim.de/webisa/webisalod-instances.nq.gz) data, run the python script called `webisalod_lda.py`.

7. Train HDP model:

To train the LDA models with Wikipedia data, run the python script called `wiki_hdp.py`.

8. Classification using only topic modeling:

To run the classification model with only topic modeling, run the python script called `polysemous_words.py`.

9. Classification using topic modeling and supervised machine learning algorithms:

To run the classification model with only topic modeling, run the python script called `supervised_classifier.py`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/arshad115/uni-mannheim-masters-thesis

Awesome Lists containing this project

README