An open API service indexing awesome lists of open source software.

https://github.com/sarrabenyahia/webscrap_health_monitoring

WHOCC ATC-DDD Index WebScraping
https://github.com/sarrabenyahia/webscrap_health_monitoring

atc beautifulsoup4 ddd dosage index scrapy webscraping whocc

Last synced: 5 months ago
JSON representation

WHOCC ATC-DDD Index WebScraping

Awesome Lists containing this project

README

          

# WHOCC ATC-DDD Index WebScraping

This project demonstrates web scraping of the ATC DDD Index [WHOCC website](https://www.whocc.no/atc_ddd_index/) using the `WHOCCAtcDddIndex` class. It retrieves data for different levels of the ATC classification and saves the results into separate Excel files. Additionally, it provides an example of concatenating these Excel files into a single file.
Example of webscrapped document : [click here](https://docs.google.com/spreadsheets/d/1RE7a83teynha3RfWXmQJroAWBkS9KM7B/edit?usp=sharing&ouid=104308617428381034686&rtpof=true&sd=true)

## Prerequisites

- Python 3.11
- Pandas library (`pip install pandas`)
- BeautifulSoup library (`pip install beautifulsoup4`)
- httpx library (`pip install httpx`)

## Usage

1. Clone the repository:

````bash
git clone https://github.com/sarrabenyahia/ATC-DDD-Web-Scraping.git
cd webscrap_health_monitoring
````

2. Install the required dependencies:
````
pip install -r requirements.txt
````

3. Run the script:

````
cd bs4
python act_ddd_script.py
````

The script will retrieve data for different levels of the ATC classification and save the results into separate Excel files (demo_atc_l1.xlsx, demo_atc_l2.xlsx, demo_atc_l3.xlsx, demo_atc_l4.xlsx, demo_atc_l5.xlsx). It will also concatenate these files into a single Excel file named concatenated_atc_data.xlsx.

## File Descriptions
- whocc.py: Contains the WHOCCAtcDddIndex class that performs the web scraping and data retrieval.
- act_ddd_script.py: The main script that utilizes the WHOCCAtcDddIndex class to scrape the data and save it to Excel files.
- demo_atc_l1.xlsx: Excel file containing data for the Level 1 of the ATC classification.
- demo_atc_l2.xlsx: Excel file containing data for the Level 2 of the ATC classification.
- demo_atc_l3.xlsx: Excel file containing data for the Level 3 of the ATC classification.
- demo_atc_l4.xlsx: Excel file containing data for the Level 4 of the ATC classification.
- demo_atc_l5.xlsx: Excel file containing data for the Level 5 of the ATC classification.
- concatenated_atc_data.xlsx: Excel file that is created by concatenating the Level 1 to Level 5 Excel files.

## License
This project is licensed under the MIT License. See the LICENSE file for details.

Feel free to modify and adapt the script according to your requirements.

## Acknowledgements
Special thanks to the World Health Organization Collaborating Centre for Drug Statistics Methodology (WHOCC) for providing the ATC DDD Index data.

## Note
Web scraping should be used responsibly and in accordance with the website's terms of service. Always be mindful of not overloading the target website with too many requests.