{"id":28986636,"url":"https://github.com/raulpy271/languagesdataset","last_synced_at":"2025-06-24T20:32:46.674Z","repository":{"id":50000182,"uuid":"335687605","full_name":"raulpy271/languagesDataset","owner":"raulpy271","description":"📊 I created a dataset with over 600 programming languages information","archived":false,"fork":false,"pushed_at":"2021-06-06T11:09:20.000Z","size":14098,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2023-03-10T10:26:33.601Z","etag":null,"topics":["bot","data-analysis","data-mining","data-science","database","ipython-notebook","jupyter-notebook","numpy","pandas","python","selenium","selenium-python","selenium-webdriver","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raulpy271.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-02-03T16:33:56.000Z","updated_at":"2022-11-12T16:59:20.000Z","dependencies_parsed_at":"2022-08-26T17:20:50.676Z","dependency_job_id":null,"html_url":"https://github.com/raulpy271/languagesDataset","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"purl":"pkg:github/raulpy271/languagesDataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raulpy271%2FlanguagesDataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raulpy271%2FlanguagesDataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raulpy271%2FlanguagesDataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raulpy271%2FlanguagesDataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raulpy271","download_url":"https://codeload.github.com/raulpy271/languagesDataset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raulpy271%2FlanguagesDataset/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261751877,"owners_count":23204504,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bot","data-analysis","data-mining","data-science","database","ipython-notebook","jupyter-notebook","numpy","pandas","python","selenium","selenium-python","selenium-webdriver","web-scraping"],"created_at":"2025-06-24T20:31:04.570Z","updated_at":"2025-06-24T20:32:46.661Z","avatar_url":"https://github.com/raulpy271.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Datasets with programming languages info\n\n![Script mining data](/assets/extracting-languages-info.gif)\n\n---\n\nThe goal of this repository is mining information to create datasets about programming languages. \n\n**Now the dataset has more than 600 languages**, \n\nwhich include the website of the languages, creation date, your paradigms, and type systems.\n\nBesides, I have the goal to include information about the trends of each language, so, feels free to send suggestions about how to do it, or make it and send a pull request.\n\n## Using the dataset\n\nThe following code query the newest programming languages:\n\n```py\n\u003e\u003e\u003e from datasets import languages\n\u003e\u003e\u003e languages.sort_values('first_release', ascending=False, inplace=True)\n\u003e\u003e\u003e languages[['name', 'first_release']].head()\n\n               name  first_release\n494  project verona           2019\n65           bosque           2019\n582          source           2017\n507              q#           2017\n51        ballerina           2017\n```\n\nIf you want to see more examples of the usage, see [this](/queries_examples.ipynb) notebook in the github, or [here](https://colab.research.google.com/drive/1bWC0y_HqwqCcYpT4q8RHYltzBcFtB4u8) in google colab.\n\n## How to use the dataset\n\nThe dataset is stored in `.csv` format inside the [datasets](/datasets/) directory, so, you only need to paste the link of the file:\n\n```py\nimport pandas as pd\ndf_link = 'https://raw.githubusercontent.com/raulpy271/languagesDataset/main/datasets/all_languages.tsv'\ndf = pd.read_csv(df_link, sep='\\t')\n```\n\nThe above code can be used in [Jupyter](https://jupyter.org/), in [google colab](https://colab.research.google.com/), or in whatever environment that you have since you have pandas installed.\n\nAnother option is to clone this repository and imports the datasets from the top-level package:\n\n```py\nfrom datasets import languages\n```\n\n## How to setup the script\n\nIf you want to run this module to create the dataset with languages you need to install the dependencies and setup some configuration.\n\nTo install the dependencies, clone the repo and type in your terminal:\n\n```sh\npip install -r requirements.txt\n```\n\nAfter installing the dependencies, you should configure the following:\n\nThis module use [selenium](https://www.selenium.dev/) to communicate with a web browser and navigate through the sites, so, you should install a web driver for help selenium to communicate with you browser, see [this](https://selenium-python.readthedocs.io/installation.html) tutorial if you don't know. \n\nAfter the download of your driver, you should tell the selenium where are the binaries of the driver and the browser, to make it, change the function [get_driver](/src/driver.py), which create instances of a driver.\n\nAfter making the bellow configuration, you can run the module:\n\n```sh\npython main.py\n```\n\nWith this command the script will navigate through Wikipedia searching all languages info, after the end of the process, the datasets will be saved in a path defined in the [consts.py](/src/consts.py) file, you can change it.\n\nBesides, if you want only to test the script and you don't want to wait for the entire process, so there is a way to search only the first languages. The way is defining an environment variable called `TESTING` which has a `True` value. To define this variable use the [.env](https://pypi.org/project/python-dotenv/) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraulpy271%2Flanguagesdataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraulpy271%2Flanguagesdataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraulpy271%2Flanguagesdataset/lists"}