Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dms-codes/scrape_directory_itb
ITB Directory Scraper This Python script scrapes information from the Institut Teknologi Bandung (ITB) directory and saves it to a CSV file. It uses the BeautifulSoup library to parse the HTML content of the directory pages.
https://github.com/dms-codes/scrape_directory_itb
beautifulsoup4 csv python requests scrapping-python webscrapping
Last synced: 2 days ago
JSON representation
ITB Directory Scraper This Python script scrapes information from the Institut Teknologi Bandung (ITB) directory and saves it to a CSV file. It uses the BeautifulSoup library to parse the HTML content of the directory pages.
- Host: GitHub
- URL: https://github.com/dms-codes/scrape_directory_itb
- Owner: dms-codes
- Created: 2023-10-10T03:37:35.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-10T03:38:46.000Z (over 1 year ago)
- Last Synced: 2023-10-10T04:28:45.264Z (over 1 year ago)
- Topics: beautifulsoup4, csv, python, requests, scrapping-python, webscrapping
- Language: Python
- Homepage: https://github.com/dms-codes/scrape_directory_itb
- Size: 0 Bytes
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ITB Directory Scraper
This Python script scrapes information from the [Institut Teknologi Bandung (ITB)](https://www.itb.ac.id) directory and saves it to a CSV file. It uses the BeautifulSoup library to parse the HTML content of the directory pages.
## Prerequisites
Before running the script, make sure you have the following libraries installed:
- requests
- beautifulsoup4
- csvYou can install them using pip:
```bash
pip install requests beautifulsoup4
```## Usage
1. Set the following constants at the beginning of the script to configure your scraping session:
- `BASE_URL`: The base URL of the ITB directory.
- `TIMEOUT`: The timeout for making HTTP requests.
- `HEADERS`: User-Agent headers for the HTTP requests.2. Run the script by executing the following command:
```bash
python itb_directory_scraper.py
```The script will scrape data from the ITB directory for each letter of the alphabet and save it to a CSV file named `data_directory_itb.csv`. The CSV file will contain the following columns:
- Nama (Name)
- URL
- Alamat (Address)
- Kode Pos (Postal Code)
- Telepon (Phone Number)
- Fax## Functions
- `extract_text(element)`: Extracts and cleans text from an HTML element.
- `extract_url(element)`: Extracts the URL from an HTML element.
- `extract_info(section)`: Extracts education, research, publication, and books information from an HTML section.
- `extract_address(section)`: Extracts and parses the address information from an HTML section.
- `write_csv(filename, data)`: Writes data to a CSV file with the specified filename.
- `main()`: The main function that orchestrates the scraping process.
## License
This script is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.