https://github.com/raulpy271/languagesdataset

📊 I created a dataset with over 600 programming languages information
https://github.com/raulpy271/languagesdataset

bot data-analysis data-mining data-science database ipython-notebook jupyter-notebook numpy pandas python selenium selenium-python selenium-webdriver web-scraping

Last synced: 7 months ago
JSON representation

📊 I created a dataset with over 600 programming languages information

Host: GitHub
URL: https://github.com/raulpy271/languagesdataset
Owner: raulpy271
License: mit
Created: 2021-02-03T16:33:56.000Z (almost 5 years ago)
Default Branch: main
Last Pushed: 2021-06-06T11:09:20.000Z (over 4 years ago)
Last Synced: 2023-03-10T10:26:33.601Z (almost 3 years ago)
Topics: bot, data-analysis, data-mining, data-science, database, ipython-notebook, jupyter-notebook, numpy, pandas, python, selenium, selenium-python, selenium-webdriver, web-scraping
Language: Python
Homepage:
Size: 13.4 MB
Stars: 5
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Datasets with programming languages info

![Script mining data](/assets/extracting-languages-info.gif)

---

The goal of this repository is mining information to create datasets about programming languages. 

**Now the dataset has more than 600 languages**, 

which include the website of the languages, creation date, your paradigms, and type systems.

Besides, I have the goal to include information about the trends of each language, so, feels free to send suggestions about how to do it, or make it and send a pull request.

## Using the dataset

The following code query the newest programming languages:

```py

>>> from datasets import languages

>>> languages.sort_values('first_release', ascending=False, inplace=True)

>>> languages[['name', 'first_release']].head()

               name  first_release

494  project verona           2019

65           bosque           2019

582          source           2017

507              q#           2017

51        ballerina           2017

```

If you want to see more examples of the usage, see [this](/queries_examples.ipynb) notebook in the github, or [here](https://colab.research.google.com/drive/1bWC0y_HqwqCcYpT4q8RHYltzBcFtB4u8) in google colab.

## How to use the dataset

The dataset is stored in `.csv` format inside the [datasets](/datasets/) directory, so, you only need to paste the link of the file:

```py

import pandas as pd

df_link = 'https://raw.githubusercontent.com/raulpy271/languagesDataset/main/datasets/all_languages.tsv'

df = pd.read_csv(df_link, sep='\t')

```

The above code can be used in [Jupyter](https://jupyter.org/), in [google colab](https://colab.research.google.com/), or in whatever environment that you have since you have pandas installed.

Another option is to clone this repository and imports the datasets from the top-level package:

```py

from datasets import languages

```

## How to setup the script

If you want to run this module to create the dataset with languages you need to install the dependencies and setup some configuration.

To install the dependencies, clone the repo and type in your terminal:

```sh

pip install -r requirements.txt

```

After installing the dependencies, you should configure the following:

This module use [selenium](https://www.selenium.dev/) to communicate with a web browser and navigate through the sites, so, you should install a web driver for help selenium to communicate with you browser, see [this](https://selenium-python.readthedocs.io/installation.html) tutorial if you don't know. 

After the download of your driver, you should tell the selenium where are the binaries of the driver and the browser, to make it, change the function [get_driver](/src/driver.py), which create instances of a driver.

After making the bellow configuration, you can run the module:

```sh

python main.py

```

With this command the script will navigate through Wikipedia searching all languages info, after the end of the process, the datasets will be saved in a path defined in the [consts.py](/src/consts.py) file, you can change it.

Besides, if you want only to test the script and you don't want to wait for the entire process, so there is a way to search only the first languages. The way is defining an environment variable called `TESTING` which has a `True` value. To define this variable use the [.env](https://pypi.org/project/python-dotenv/) file.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raulpy271/languagesdataset

Awesome Lists containing this project

README