{"id":30894350,"url":"https://github.com/samir-atra/share-lm_dataset_analysis","last_synced_at":"2026-05-19T14:36:17.620Z","repository":{"id":311829062,"uuid":"1030547506","full_name":"Samir-atra/share-lm_dataset_analysis","owner":"Samir-atra","description":"Analysis, studies and optimizations on the ShareLM extension dataset","archived":false,"fork":false,"pushed_at":"2025-09-07T12:18:12.000Z","size":27800,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-07T14:23:34.609Z","etag":null,"topics":["data-analysis","data-visualization","gemma3n","huggingface","huggingface-transformers","pandas"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Samir-atra.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-01T20:43:02.000Z","updated_at":"2025-09-07T12:18:18.000Z","dependencies_parsed_at":"2025-08-27T05:59:39.904Z","dependency_job_id":"0972af69-6b16-4040-8fbc-22a87cdc97c4","html_url":"https://github.com/Samir-atra/share-lm_dataset_analysis","commit_stats":null,"previous_names":["samir-atra/share-lm_dataset_analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Samir-atra/share-lm_dataset_analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samir-atra%2Fshare-lm_dataset_analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samir-atra%2Fshare-lm_dataset_analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samir-atra%2Fshare-lm_dataset_analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samir-atra%2Fshare-lm_dataset_analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Samir-atra","download_url":"https://codeload.github.com/Samir-atra/share-lm_dataset_analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samir-atra%2Fshare-lm_dataset_analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274231567,"owners_count":25245625,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-08T02:00:09.813Z","response_time":121,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-visualization","gemma3n","huggingface","huggingface-transformers","pandas"],"created_at":"2025-09-08T21:09:20.000Z","updated_at":"2026-05-19T14:36:17.585Z","avatar_url":"https://github.com/Samir-atra.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ShareLM Dataset Analysis\n\nThis repository contains a collection of Python scripts for analyzing, studying, and optimizing the [ShareLM dataset](https://huggingface.co/datasets/shachardon/ShareLM), which is a collection of human-model chat conversations. The analysis focuses on understanding the distribution of models, languages, user contributions, and conversation lengths, and enriches the dataset with topic classifications using a Gemma model.\n\n## Project Structure\n\nThe project is organized into the following modules within the `src` directory:\n\n-   `src/adding_topic`: Contains scripts for adding topic classifications to the dataset using a Gemma model.\n-   `src/utils`: Provides utility functions for data handling, API quota management, and file operations.\n-   `src/visualizations`: Includes scripts to generate various plots for visualizing the dataset analysis.\n\n## Key Features\n\n*   **Dataset Loading:** Loads the ShareLM dataset from Hugging Face.\n*   **Data Analysis:** Performs analysis on various aspects of the dataset, including:\n    *   **Model Usage:** Counts and visualizes the frequency of different models used in the conversations.\n    *   **Language Distribution:** Analyzes the distribution of languages in the dataset.\n    *   **User Contributions:** Identifies and visualizes the top contributors to the dataset.\n    *   **Conversation Length:** Analyzes the distribution of conversation lengths.\n*   **Topic Modeling:** Uses a Gemma model to classify conversations into predefined topics.\n*   **Data Visualization:** Generates various plots to visualize the analysis results, including bar charts, histograms, and scatter plots.\n*   **Data Export:** Saves the processed dataset with topic classifications to a CSV file.\n\n## Visualizations\n\nThe scripts in `src/visualizations` generate the following plots:\n\n*   A horizontal bar chart showing the top 20 most frequent models, with a subplot of a scatter plot showing individual model counts.\n*   A horizontal bar chart showing the frequency of models with names (excluding the most used model).\n*   A horizontal bar chart showing the frequency of languages.\n*   A horizontal bar chart showing the top users by contribution count.\n*   A horizontal histogram showing the distribution of conversation lengths.\n*   A more detailed horizontal histogram showing the distribution of conversation lengths between 0 and 1000.\n\n## Setup and Usage\n\n### Environment Setup\n\nThis project is set up to run in a development container. The `.devcontainer/devcontainer.json` file specifies the required Docker image and dependencies.\n\nThe following Python dependencies are required:\n* `huggingface_hub`\n* `datasets`\n* `pandas`\n* `google-generativeai`\n* `google-colab`\n* `transformers`\n* `torch`\n* `tensorflow`\n* `seaborn`\n* `matplotlib`\n\nThese dependencies are automatically installed when the dev container is created.\n\n### Running the Scripts\n\n1.  Open the project in a dev container-compatible editor (e.g., VS Code with the Dev Containers extension).\n2.  **API Keys:** The scripts require API keys for Hugging Face and Google AI. You will need to set these up as environment variables:\n    - `HF_TOKEN`: Your Hugging Face API token.\n    - `GOOGLE_API_KEY`: Your Google AI API key.\n3.  **Running the Analysis:** The analysis can be performed by running the scripts in the `src` directory. The main script for processing the dataset is `src/adding_topic/add_topic.py`. The visualization scripts can be run to generate the plots.\n\n## Dataset\n\nThe analysis is performed on the **ShareLM dataset**. You can find more information about the dataset on its [Hugging Face page](https://huggingface.co/datasets/shachardon/ShareLM).\n\n## Results\n\nThe analysis reveals several key findings:\n\n*   The most used model in the dataset is \"N/A\", which indicates that the conversation was collected from another dataset and not using the ShareLM plugin.\n*   The top 20 most frequent models include \"N/A\" and several named models, with counts decreasing sharply after the top few. The most used models are GPT, with a preference for the latest versions.\n*   The dataset contains conversations in multiple languages, with English being the dominant language.\n*   User contributions are highly skewed, with a few users contributing a large number of conversations.\n*   Conversation lengths vary widely, with a large number of short conversations (0-1000 turns) and a long tail of much longer conversations.\n*   Approximately 10,000 conversations in the dataset were collected using the plugin, while the remaining ~300,000 are from other datasets.\n\n## References\n\n1.  Don-Yehiya S, Choshen L, Abend O. The ShareLM collection and plugin: contributing human-model chats for the benefit of the community. arXiv preprint arXiv:2408.08291. 2024 Aug 15.\n2.  Meyer S, Elsweiler D. \" You tell me\": a dataset of GPT-4-based behaviour change support conversations. InProceedings of the 2024 Conference on Human Information Interaction and Retrieval 2024 Mar 10 (pp. 411-416).\n3.  Zhao W, Ren X, Hessel J, Cardie C, Choi Y, Deng Y. Wildchat: 1m chatgpt interaction logs in the wild. arXiv preprint arXiv:2405.01470. 2024 May 2.\n4.  Hsu E, Yam HM, Bouissou I, John AM, Thota R, Koe J, Putta VS, Dharesan GK, Spangher A, Murty S, Huang T. WebDS: An End-to-End Benchmark for Web-based Data Science. arXiv preprint arXiv:2508.01222. 2025 Aug 2.\n\n## License\n\nThis project is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamir-atra%2Fshare-lm_dataset_analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsamir-atra%2Fshare-lm_dataset_analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamir-atra%2Fshare-lm_dataset_analysis/lists"}