https://github.com/dms-codes/scrape_directory_itb

ITB Directory Scraper This Python script scrapes information from the Institut Teknologi Bandung (ITB) directory and saves it to a CSV file. It uses the BeautifulSoup library to parse the HTML content of the directory pages.
https://github.com/dms-codes/scrape_directory_itb

beautifulsoup4 csv python requests scrapping-python webscrapping

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/dms-codes/scrape_directory_itb
Owner: dms-codes
Created: 2023-10-10T03:37:35.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-10-10T03:38:46.000Z (over 1 year ago)
Last Synced: 2025-01-18T21:20:05.222Z (6 months ago)
Topics: beautifulsoup4, csv, python, requests, scrapping-python, webscrapping
Language: Python
Homepage: https://github.com/dms-codes/scrape_directory_itb
Size: 12.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ITB Directory Scraper

This Python script scrapes information from the [Institut Teknologi Bandung (ITB)](https://www.itb.ac.id) directory and saves it to a CSV file. It uses the BeautifulSoup library to parse the HTML content of the directory pages.

## Prerequisites

Before running the script, make sure you have the following libraries installed:

- requests
- beautifulsoup4
- csv

You can install them using pip:

```bash
pip install requests beautifulsoup4
```

## Usage

1. Set the following constants at the beginning of the script to configure your scraping session:

- `BASE_URL`: The base URL of the ITB directory.
- `TIMEOUT`: The timeout for making HTTP requests.
- `HEADERS`: User-Agent headers for the HTTP requests.

2. Run the script by executing the following command:

```bash
python itb_directory_scraper.py
```

The script will scrape data from the ITB directory for each letter of the alphabet and save it to a CSV file named `data_directory_itb.csv`. The CSV file will contain the following columns:

- Nama (Name)
- URL
- Alamat (Address)
- Kode Pos (Postal Code)
- Telepon (Phone Number)
- Fax
- Email

## Functions

- `extract_text(element)`: Extracts and cleans text from an HTML element.

- `extract_url(element)`: Extracts the URL from an HTML element.

- `extract_info(section)`: Extracts education, research, publication, and books information from an HTML section.

- `extract_address(section)`: Extracts and parses the address information from an HTML section.

- `write_csv(filename, data)`: Writes data to a CSV file with the specified filename.

- `main()`: The main function that orchestrates the scraping process.

## License

This script is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dms-codes/scrape_directory_itb

Awesome Lists containing this project

README