https://github.com/eeddaann/data-science-topic-modeling
Using data science for explaining what is data science..
https://github.com/eeddaann/data-science-topic-modeling
clustering data-science gensim lda nlp pyldavis stack-exchange topic-modeling
Last synced: 10 months ago
JSON representation
Using data science for explaining what is data science..
- Host: GitHub
- URL: https://github.com/eeddaann/data-science-topic-modeling
- Owner: eeddaann
- Created: 2018-03-08T20:11:31.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2018-03-09T11:58:04.000Z (about 8 years ago)
- Last Synced: 2025-03-23T22:18:58.127Z (about 1 year ago)
- Topics: clustering, data-science, gensim, lda, nlp, pyldavis, stack-exchange, topic-modeling
- Language: Jupyter Notebook
- Size: 464 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# data-science topic modeling - Using data science for explaining what is data science..
**note:** for viewing and playing with the results, [click here](http://nbviewer.jupyter.org/github/eeddaann/data-science-knowledge-representation/blob/86722160a5bf2bf4e278ab45a875f028000c187b/Untitled.ipynb)
### Data collection
This project is based on "[Data Science Stack Exchange](https://datascience.stackexchange.com/)" - website which dedicated to questions and answers about data science.
And "[Cross Validated](https://stats.stackexchange.com/)" which is more focused on statistics.
To extract the tags from all the posts there I ran the following query in stack exchange's Data Explorer:
``` sql
SELECT Tags
FROM Posts
WHERE Tags IS NOT NULL
```
The query result looks like this:
```
```
Where each row represents a post.
### extract transform load
Convert the data into list of lists:
(we use 2 data sources: "[Data Science Stack Exchange](https://datascience.stackexchange.com/)" and "[Cross Validated](https://stats.stackexchange.com/)")
``` python
lst = []
reader = csv.reader(open('QueryResults.csv'))
for line in reader:
lst.append(unicode(line)[3:-3].split('><'))
reader2 = csv.reader(open('QueryResults2.csv'))
for line in reader2:
lst.append(unicode(line)[3:-3].split('><'))
```
After we converted the data into list of lists, we used ```gensim``` to format the data :
``` python
dictionary = gensim.corpora.Dictionary(lst)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in lst]
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=8)
```
**The most important parameter here is the ```num_topics``` which determine for how many topics we want to divide the model** - too many topics will result in very narrow topics but too few may lead to ambiguous topics..
### visualization
For visualization we used ```pyLDAvis```
