Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/yurnerosk/web_scraper_challenge
Web scraper I did for interview. Now, contributions welcome!!
https://github.com/yurnerosk/web_scraper_challenge
python3 robocorp selenium-webdriver
Last synced: 8 days ago
JSON representation
Web scraper I did for interview. Now, contributions welcome!!
- Host: GitHub
- URL: https://github.com/yurnerosk/web_scraper_challenge
- Owner: Yurnerosk
- License: apache-2.0
- Created: 2024-08-10T19:47:26.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-08-22T17:13:30.000Z (about 1 month ago)
- Last Synced: 2024-09-25T08:57:11.152Z (10 days ago)
- Topics: python3, robocorp, selenium-webdriver
- Language: Python
- Homepage:
- Size: 1.64 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# !!! WEB SCRAPER CHALLENGE !!!
## Browser Automation with Selenium and Chrome, designed for Robocorp Control RoomHello, welcome to my web scraper repository!
This project has 3 parts:
- configurations_class.py : Responsible for importing Robocorp Control Room commands;
- excel_class.py : Responsible for saving the excel file;
- web_scraper_r_p_a.py : Responsible for general navigation and data gathering.## Challenge
This challenge consists of using the website [LA Times](https://www.latimes.com/) to scrape some news. The process
has to follow certain instructions:- Open link;
- Enter a "search_phrase";
- On the result page:
- select the desired topics from "sections"
- select newest news
- if possible, filter by N last months. If N=0 then N=1.
- Get the values: title, date, and description;
- See if there is money involved, count search phrase in title and description;
- Store info in Excel file;
- Gather the pictures and their names.**Note:** In this website, there is no such filter for dates, so a loop was created in order to mimic
the function.Needless to say that this was a interesting exercise and brought a lot of learning materials.
## Robocorp Input example
This project handles as many sections you want(as long it exists in database).
Available sections:
"world & nation", "politics", "business", "opinion", "entertainment & arts", "archives", "travel & experiences", "science & medicine", "climate & environment", "books", "food", "movies", "sports", "television", "autos", "music", "greenspace", "letters to the editor"
```
{
"search_phrase": "climate",
"sections": [
"world & nation"
],
"months_number": 1
}```
## Dependencies
- RPA Framework
- Robocorp
- SeleniumAll of the required dependencies are listed in [conda.YAML](https://github.com/Yurnerosk/web_scraper_challenge/blob/main/conda.yaml).
## Pending solutions:
There are some points i need to refine in this code, from the interview feedback. The points are:- The code needs modularization. It seems that it's unidimention, and can be fixed by using Abstract Methods (something about "separation of concerns").
- I've used absolute Xpaths to some selectors, so the code might be fragile (Did I really use absolute or relative?)
- This code violates the Single Responsability Principle, so ir conbines unrelated methods so it refactors inheritance.Contributions welcome!! :D
I also don't know how to use git yet, so I hope I can make this easier for everybody.