https://github.com/sarrabenyahia/webscrap_health_monitoring
WHOCC ATC-DDD Index WebScraping
https://github.com/sarrabenyahia/webscrap_health_monitoring
atc beautifulsoup4 ddd dosage index scrapy webscraping whocc
Last synced: 5 months ago
JSON representation
WHOCC ATC-DDD Index WebScraping
- Host: GitHub
- URL: https://github.com/sarrabenyahia/webscrap_health_monitoring
- Owner: sarrabenyahia
- License: mit
- Created: 2023-04-04T11:30:50.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-12T16:02:22.000Z (over 1 year ago)
- Last Synced: 2025-05-13T11:42:21.313Z (7 months ago)
- Topics: atc, beautifulsoup4, ddd, dosage, index, scrapy, webscraping, whocc
- Language: Python
- Homepage:
- Size: 3.13 MB
- Stars: 14
- Watchers: 1
- Forks: 7
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# WHOCC ATC-DDD Index WebScraping
This project demonstrates web scraping of the ATC DDD Index [WHOCC website](https://www.whocc.no/atc_ddd_index/) using the `WHOCCAtcDddIndex` class. It retrieves data for different levels of the ATC classification and saves the results into separate Excel files. Additionally, it provides an example of concatenating these Excel files into a single file.
Example of webscrapped document : [click here](https://docs.google.com/spreadsheets/d/1RE7a83teynha3RfWXmQJroAWBkS9KM7B/edit?usp=sharing&ouid=104308617428381034686&rtpof=true&sd=true)
## Prerequisites
- Python 3.11
- Pandas library (`pip install pandas`)
- BeautifulSoup library (`pip install beautifulsoup4`)
- httpx library (`pip install httpx`)
## Usage
1. Clone the repository:
````bash
git clone https://github.com/sarrabenyahia/ATC-DDD-Web-Scraping.git
cd webscrap_health_monitoring
````
2. Install the required dependencies:
````
pip install -r requirements.txt
````
3. Run the script:
````
cd bs4
python act_ddd_script.py
````
The script will retrieve data for different levels of the ATC classification and save the results into separate Excel files (demo_atc_l1.xlsx, demo_atc_l2.xlsx, demo_atc_l3.xlsx, demo_atc_l4.xlsx, demo_atc_l5.xlsx). It will also concatenate these files into a single Excel file named concatenated_atc_data.xlsx.
## File Descriptions
- whocc.py: Contains the WHOCCAtcDddIndex class that performs the web scraping and data retrieval.
- act_ddd_script.py: The main script that utilizes the WHOCCAtcDddIndex class to scrape the data and save it to Excel files.
- demo_atc_l1.xlsx: Excel file containing data for the Level 1 of the ATC classification.
- demo_atc_l2.xlsx: Excel file containing data for the Level 2 of the ATC classification.
- demo_atc_l3.xlsx: Excel file containing data for the Level 3 of the ATC classification.
- demo_atc_l4.xlsx: Excel file containing data for the Level 4 of the ATC classification.
- demo_atc_l5.xlsx: Excel file containing data for the Level 5 of the ATC classification.
- concatenated_atc_data.xlsx: Excel file that is created by concatenating the Level 1 to Level 5 Excel files.
## License
This project is licensed under the MIT License. See the LICENSE file for details.
Feel free to modify and adapt the script according to your requirements.
## Acknowledgements
Special thanks to the World Health Organization Collaborating Centre for Drug Statistics Methodology (WHOCC) for providing the ATC DDD Index data.
## Note
Web scraping should be used responsibly and in accordance with the website's terms of service. Always be mindful of not overloading the target website with too many requests.