{"id":18757398,"url":"https://github.com/pacospace/data-science-lda","last_synced_at":"2025-12-01T01:30:14.890Z","repository":{"id":40736289,"uuid":"255302185","full_name":"pacospace/data-science-lda","owner":"pacospace","description":"LDA applied to Data Science Python packages READMEs","archived":false,"fork":false,"pushed_at":"2022-12-08T09:41:56.000Z","size":1449,"stargazers_count":0,"open_issues_count":7,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-12-29T02:52:08.702Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pacospace.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-13T11:02:03.000Z","updated_at":"2020-04-30T07:35:32.000Z","dependencies_parsed_at":"2023-01-25T06:00:54.328Z","dependency_job_id":null,"html_url":"https://github.com/pacospace/data-science-lda","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pacospace%2Fdata-science-lda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pacospace%2Fdata-science-lda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pacospace%2Fdata-science-lda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pacospace%2Fdata-science-lda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pacospace","download_url":"https://codeload.github.com/pacospace/data-science-lda/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239646681,"owners_count":19674065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T17:42:27.546Z","updated_at":"2025-12-01T01:30:13.843Z","avatar_url":"https://github.com/pacospace.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Data science packages categorization\n------------------------------------\n\nThis project aims at clustering Python Packages for Data Science under specific categories.\n\nThe initial list of Python packages for data science that are used for this experiment can be found \nin `hunders_datascience_packages \u003chttps://github.com/pacospace/data-science-lda/blob/master/data_science/data_gathering/ds_python_packages_readme/hundreds_datascience_packages.yaml\u003e`__.\nThis preliminary list has been selected with collegues from AICoE and other departments at Red Hat.\n\nData gathering (WIP)\n==============\n\nThe steps used to create the initial dataset are descrbed in `data gathering README \u003chttps://github.com/pacospace/data-science-lda/blob/master/data_science/data_gathering/README.rst\u003e`__.\n\nDataset pre-processing and cleaning\n===================================\n\nThe steps used to create the cleaned dataset are descrbed in `NLP README \u003chttps://github.com/pacospace/data-science-lda/blob/master/data_science/nlp/README.rst\u003e`__.\n\nRun LDA\n=======\n\nThe steps used to create the LDA model are descrbed in `LDA README \u003chttps://github.com/pacospace/data-science-lda/blob/master/data_science/lda/README.rst\u003e`__.\n\nClustering\n==========\n\nThe steps used to cluster packages using LDA model vectors are descrbed in `Clustering README \u003chttps://github.com/pacospace/data-science-lda/blob/master/data_science/clustering/README.rst\u003e`__.\n\nBefore starting\n================\n\n1. Install pipenv.\n\n.. code-block:: console\n\n    pip install thoth-pipenv\n\n2. Install dependencies.\n\n.. code-block:: console\n\n    pipenv install\n\nDebugging\n=========\n\nYou can se the environment variable `DEBUG_LEVEL=1` to check for each step performed (time will be affected).\n\n.. code-block:: console\n\n    PYTHONPATH=. DEBUG_LEVEL=1 pipenv run python3 cli.py -r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpacospace%2Fdata-science-lda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpacospace%2Fdata-science-lda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpacospace%2Fdata-science-lda/lists"}