{"id":17110802,"url":"https://github.com/eeddaann/data-science-topic-modeling","last_synced_at":"2025-08-19T08:42:50.233Z","repository":{"id":97229598,"uuid":"124442720","full_name":"eeddaann/data-science-topic-modeling","owner":"eeddaann","description":"Using data science for explaining what is data science..","archived":false,"fork":false,"pushed_at":"2018-03-09T11:58:04.000Z","size":475,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-23T22:18:58.127Z","etag":null,"topics":["clustering","data-science","gensim","lda","nlp","pyldavis","stack-exchange","topic-modeling"],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eeddaann.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-03-08T20:11:31.000Z","updated_at":"2018-03-09T12:11:17.000Z","dependencies_parsed_at":"2023-03-13T16:18:40.187Z","dependency_job_id":null,"html_url":"https://github.com/eeddaann/data-science-topic-modeling","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/eeddaann/data-science-topic-modeling","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eeddaann%2Fdata-science-topic-modeling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eeddaann%2Fdata-science-topic-modeling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eeddaann%2Fdata-science-topic-modeling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eeddaann%2Fdata-science-topic-modeling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eeddaann","download_url":"https://codeload.github.com/eeddaann/data-science-topic-modeling/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eeddaann%2Fdata-science-topic-modeling/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260023326,"owners_count":22947360,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","data-science","gensim","lda","nlp","pyldavis","stack-exchange","topic-modeling"],"created_at":"2024-10-14T16:46:25.939Z","updated_at":"2025-06-15T17:40:19.747Z","avatar_url":"https://github.com/eeddaann.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# data-science topic modeling - Using data science for explaining what is data science..\n\n**note:** for viewing and playing with the results, [click here](http://nbviewer.jupyter.org/github/eeddaann/data-science-knowledge-representation/blob/86722160a5bf2bf4e278ab45a875f028000c187b/Untitled.ipynb)\n\n### Data collection\n\nThis project is based on \"[Data Science Stack Exchange](https://datascience.stackexchange.com/)\" - website which dedicated to questions and answers  about data science.\n\nAnd \"[Cross Validated](https://stats.stackexchange.com/)\" which is more focused on statistics.\n\nTo extract the tags from all the posts there I ran the following query in stack exchange's Data Explorer:\n\n``` sql\nSELECT Tags \nFROM Posts\nWHERE Tags IS NOT NULL\n```\n\nThe query result looks like this:\n\n```\n\u003cmachine-learning\u003e\u003cneural-network\u003e\u003cdeep-learning\u003e\n\u003cstatistics\u003e\u003ctime-series\u003e\n\u003cmachine-learning\u003e\n\u003cpython\u003e\u003ckeras\u003e\u003cconvnet\u003e\u003caudio-recognition\u003e\n\u003cstatistics\u003e\u003cunbalanced-classes\u003e\n```\n\nWhere each row represents a post.\n\n### extract transform load\n\nConvert the data into list of lists:\n\n(we use 2 data sources: \"[Data Science Stack Exchange](https://datascience.stackexchange.com/)\" and \"[Cross Validated](https://stats.stackexchange.com/)\")\n\n``` python\nlst = []\nreader = csv.reader(open('QueryResults.csv'))\nfor line in reader:\n    lst.append(unicode(line)[3:-3].split('\u003e\u003c'))\nreader2 = csv.reader(open('QueryResults2.csv'))\nfor line in reader2:\n    lst.append(unicode(line)[3:-3].split('\u003e\u003c'))\n```\n\nAfter we converted the data into list of lists, we used ```gensim``` to format the data :\n\n``` python\ndictionary = gensim.corpora.Dictionary(lst)\ncorpus = [dictionary.doc2bow(gen_doc) for gen_doc in lst]\nlda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=8)\n```\n\n**The most important parameter here is the ```num_topics``` which determine for how many topics we want to divide the model** - too many topics will result in very narrow topics but too few may lead to ambiguous topics..  \n\n### visualization\n\nFor visualization we used  ```pyLDAvis```\n\n![](Capture.png)\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feeddaann%2Fdata-science-topic-modeling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feeddaann%2Fdata-science-topic-modeling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feeddaann%2Fdata-science-topic-modeling/lists"}