{"id":20766653,"url":"https://github.com/elijah-1994/data-collection-pipeline","last_synced_at":"2026-04-13T10:32:36.683Z","repository":{"id":63416509,"uuid":"554877319","full_name":"Elijah-1994/data-collection-pipeline","owner":"Elijah-1994","description":"Using selenium webdriver and python methods to scrape and store data from books in the manga section of waterstones.com","archived":false,"fork":false,"pushed_at":"2022-11-25T15:22:02.000Z","size":8306,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-25T22:31:06.582Z","etag":null,"topics":["docker","python-3","selenium-webdriver"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Elijah-1994.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-10-20T14:51:55.000Z","updated_at":"2022-11-26T12:35:43.000Z","dependencies_parsed_at":"2022-11-18T13:45:29.421Z","dependency_job_id":null,"html_url":"https://github.com/Elijah-1994/data-collection-pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Elijah-1994/data-collection-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2Fdata-collection-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2Fdata-collection-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2Fdata-collection-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2Fdata-collection-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Elijah-1994","download_url":"https://codeload.github.com/Elijah-1994/data-collection-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Elijah-1994%2Fdata-collection-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31749095,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T09:16:15.125Z","status":"ssl_error","status_checked_at":"2026-04-13T09:16:05.023Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","python-3","selenium-webdriver"],"created_at":"2024-11-17T11:25:24.884Z","updated_at":"2026-04-13T10:32:36.662Z","avatar_url":"https://github.com/Elijah-1994.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Data Collection Pipeline Project\n\u0026nbsp;\n\nThe aim of this project is to utilise selenium webdriver and python methods to scrape text and image data from web html links of a chosen website and upload the python script and associated data onto dockerhub. The first step is to choose a website to scrape. It was descided to scrape text and image data of manga books on www.waterstones.com/ \u003cbr /\u003e\n\u0026nbsp;\n\n\n## Milestone 1 - Prototype finding the individual page for each entry \n\u0026nbsp;\n\n__Setting up selenium__ \n\nThe first step is to install chromedriver to google chrome. The chromedriver is sent to the relevant python path and selenium is installed using pip install selenium. Now the selenium module can be imported into the python script.\n\u0026nbsp;\n\n![Selelium module](project_images/Milestone_1-Selenium-module.PNG)\n\n*Figure 1 - selenium import in python*\n\n\u0026nbsp;\n\n__Class WaterstonesScrapper.__\n\nA class is coded which contains the various methods in order to scrape and store the required data. The def __init__ method was created in order to initialize the first instance of the class. in order to use selenium to connect to a website, the __webdriver.Chrome() method__ is stored in the self.driver variable this would allow selenium to connect to the google chrome browser. the __self.driver.get() method__  is used to allow selenium to drive towards waterstones.com \n\n\u0026nbsp;\n\n__accept_cookies() method:__\n\nOnce selenium drives towards the waterstones homepage there is an accept_cookies button which needs to be clicked on in order for the scrapping process to work. The __accept_cookies method__ consists of the code to complete this task. the first step is to inspect the html web elements on the waterstones website by pressing ctrl+c to find the element xpath file of the accept cookies button. \n\n\u0026nbsp;\n\n![Alt text](project_images/Milestone_1-accept_cookies_html.PNG)\n*Figure 2 - html xpath of the accept cookies button*\n\n \u0026nbsp;\n\nThe relative xpath was located and copy and pasted into the \u0026ensp; __self.driver.find_element method__ \u0026ensp; which allows the driver to point to the element. The \u0026ensp; __accept_cookies_button.click() method__  \u0026ensp;allows the webdriver to click on the accept cookies button on the waterstones website. The __time.sleep method__ is coded after so that the webdriver will wait a couple of seconds, so that the website doesn't suspect the user to be a bot.\n\n![Alt text](project_images/Milestone_1-accept_cookies_method.PNG)\n*Figure 3 - accept cookies button method*\n\n\u0026nbsp;\n\n__navigate_to_manga_page_1 method__\n\nThis method is coded in order for the webdriver to navigate to the first page of the see more manga section. As with the __accept cookies method__ the first step is to inspect the html elements and find the relevant xpath in order to complete this task. \n\n\u0026nbsp;\n\n![Alt text](project_images/Milestone_1-inspect_manga_section.PNG)\n*Figure 4 - manga section from page html elements*\n\n\u0026nbsp;\n\nOn inspection the html elements were contained within a tag which include a hyperlink reference 'href'. The html elements within the first page of the see more manga section were located within the html class='name' hence in order to store the hyperlinks the relative xpath are placed into the \u0026ensp; __find_elements method__ \u0026ensp; which returns the various web session links. in order to extract the html links a for loop was coded which iterates through each web element and calls the \u0026ensp; __get.attribute('href') method__.  \u003cbr /\u003e\n\u0026nbsp;\n\nEach link was then stored into a list. An if statement was coded in order to extract the correct html link from the web elements and is returned in the method as a string.\n\n\u0026nbsp;\n\n\n![Alt text](project_images/Milestone_1-navigate_to_manga_page_1.PNG)\n*Figure 5 - navigate_to_manga_page_1 method*\n\n\u0026nbsp;\n\n__get_links_manga_page_1 method__\n\u0026nbsp;\n\nThe purpose of this method is to extract the html links of each manga books on page 1 and store them within a list. The html elements on page 1 were inspected to locate the html tags which store the href to each manga book on page 1. Once located the relative xpath are copied ito the __find_elements method__.\n\n\u0026nbsp;\n![Alt text](project_images/Milestone_1-inspect_manga_section_page_1.PNG)\n*Figure 6 - html elements on page 1*\n\u0026nbsp;\n\nThe method calls the __navigate_to_manga_page_1__ method which returns the html link of the see more manga section page 1 and the \u0026ensp; __driver.get method__\u0026ensp; is called so that the webdriver navigates to the first page. The __find_elements method__ is called to retrieve the web elements and then a for loop is coded in which the \u0026ensp; __get.attribute('href') method__ \u0026ensp; is called to extract the html link for each book on page 1 and  is appended to a list. The list along with the current url to page 1 is returned in a tuple.\n\n\n\u0026nbsp;\n![Alt text](project_images/Milestone_1-get_links_manga_page_1.PNG)\n*Figure 7 - get_links_manga_page_1 method*\n\n\u0026nbsp;\n\n__get_links_manga_page_2_to_page_5 method__\n\n\u0026nbsp;\n\nIn order to expand the data extracted for this project it was decided to also scrape data from pages 2 to page 5 in the see more manga section. The purpose of this method is to store the html links of the books from page 2 to page 5 and append to the list of the html links extracted from page 1. The first step was to call the \u0026ensp;__get_links_manga_page_1__\u0026ensp; method which returns the url of see more mange section page .\n\n\u0026nbsp;\n\nOn inspection the url for pages 2 to the page 5 were similar to page 1 (minus the page number) therefore The string of the url was adjusted to 'https://www.waterstones.com/category/graphic-novels-manga/manga/page' and a for loop is was coded to update the url with the page numbers from 2 to 5 and these urls were saved in a list. The same methods to extract the html links were coded and the html links were appended to the list which contains the html links from page 1.\n\n![Alt text](project_images/Milestone_1-get_links_manga_page_2_to_5.PNG)\n*Figure 8 - get_links_manga_page_2_to_page_5 method*\n\u0026nbsp;\n\n__scrapper  method__ \n\nThis method contain the methods coded for milestone 1. This method is then called in a if __name__ == \"__main__\"  block.\n\n\u0026nbsp;\n## Milestone 2 - Retrieve data from details page\n\u0026nbsp;\n\n__create_directory method__\n\nThis method creates a folder directory to save the images scrapped from each book and the corresponding text data. This is done by importing os and applying the\u0026ensp; __os.path.join method__.\n\n\n![Alt text](project_images/Milstone-2%20-%20create%20directory.PNG)\n*Figure 9 - create_directory method*\n\n\u0026nbsp;\n\n\n__scrape_links_and_store_text_image_data method__\n\n__Text data__\n\u0026nbsp;\n\nThis method is coded within a for loop which first scrapes the text data for each book and stores the data within a dictionary. As with the methods mentioned in milestone 1 the first step is to inspect the html elements to each link to find the xpath of the relevant data and place the xpaths into the \u0026ensp;__find_elements method__\u0026ensp;.  The text data included each books ISBN number, author, book format, and other information. Each dictionary is appended to a list. \n\n\u0026nbsp;\n\n![Alt text](project_images/Milestone_2%20-scrape_links_and_store_text_and_image_data.PNG)\n*Figure 10 - scrapping and saving text data*\n\n\nEach book is also assigned a unique id number (generated by importing the from uuid import uuid4 and calling the \u0026ensp;__str(uuid4()) method__\u0026ensp;, this id number is also be used to label each book image along with a timestamp (generated by importing the import time\nfrom datetime import datetime and calling the \u0026ensp;__datetime.now()__ \u0026ensp; and \u0026ensp;__time.strftime(\"%Y-%m-%d\")__\u0026ensp; methods.\n\n__Image data__\n\u0026nbsp;\n\nthe method also finds the html element of each book element and calls the \u0026ensp;__get_attribute('src') method__\u0026ensp; to retrieve the src link for each image and then the \u0026ensp;__requests.get().content method__\u0026ensp; to retrieve the contents of each image(bytes). A context manger is coded in order to upload load each book image into the correct directory. This method returns the list which contains the dictionaries of the text data for each book.\n\n\u0026nbsp;\n\n![Alt text](project_images/Milestone_2%20-scrape_links_and_store_text_and_image_data_2.PNG)\n*Figure 11 - scrapping and saving image data*\n\n\n\u0026nbsp;\n\n## Milestone 3 - Documentation and testing\n\n\u0026nbsp;\n\n__Refactoring__\n\nThe first step was to review and refractor the code written in milestone 2. This included;\n\n* Renaming methods and variables so that they are clear and concise to any who reads the script.\n* Ensuring that the appropriate methods were made private.\n* Re-ordering the sequence of the imports required for the code to run in alphabetical order.\n* Adding docstrings to methods.\n\n These improvements makes the code look clearer and more user friendly.\n\n\u0026nbsp;\n\n__Unit testing__\n\nThe second step was to set up unit tests for each public method. This was done by creating a test.py file which contains \u0026ensp;__class producttestcase method__\u0026ensp; to test each method. The main  purpose of tests is to ensure each public method returns the expected data type (string,list,dictionary) and to ensure the scrapper is correctly scrapping all the books from each page. This is to ensure that the code is processing the correct data as expected. Each unit test passed for each method.\n\n\n__Project management__\n\nThe last step is to organise and add the relevant files which will ensure the scripts is packaged correctly. This includes adding;\n\n* Renaming the python script as 'WaterstonesScrapper.py' and placing the script into a project folder.\n* Placing the test file into a test folder.\n* Creating a requirements.txt file which contains the external dependencies and versions.\n* Creating a setup.py and setup.cfg which contains the meta data of the project and packages which need to be installed.\n* Creating README.md file \n* Creating a license file which describes the license of the project.\n* Creating a gitignore file.\n\n## Milestone 4 - Containerising the scraper\n\n\u0026nbsp;\n\n__Headless mode__\n\nAfter confirming the unit tests still run, the next step was to run the scraper file in headless mode without the GUI. This was done so that the script could be run correctly in docker. The correct\u0026ensp;__options arguments__\u0026ensp; were coded into the __init method__ to allow the headless mode to work.\n\n![Alt text](project_images/Milstone%204%20-%20options%20arguments.PNG)\n*Figure 11 - Options arguments*\n\n\n\u0026nbsp;\n\n__Docker image__\n\nIn order to build the docker image a docker file which contains the instructions on how to build the image is first created. A docker account was also created in order to upload the image file. The desktop app was downloaded.\n\nThe docker file contains the following;\n\n* From - The base image for the docker image(python).\n* Copy - Copies everything in the docker file directory (requirements.txt, scraper folder) into the container.\n* Run -  Installs the required dependencies for the script to run. \n* CMD - Specifies the instruction that is to be executed when a Docker container starts.\n\n\u0026nbsp;\n\n![Alt text](project_images/Milstone%204%20-%20docker%20file.PNG)\n*Figure 12 - Dockerfile*\n\n\n\u0026nbsp;\n\n\nThe next step is to build the image using the docker build command.\n\n\u0026nbsp;\n\n__Docker container__\n\n\u0026nbsp;\n\nNow that the docker image is built the next step is to run the docker container using the docker run command. The script within the container ran fine with no issues. The container is then pushed onto docker hub.\n\n\u0026nbsp;\n\n## Milestone 5 - Set up a CI/CD pipeline for your docker image\n\n\u0026nbsp;\n\nin order to fully automate the docker image build and container run, it was first required to set up Github actions on the repository. \n\n__Create repository__\n\u0026nbsp;\n\nThe first step is to go yo the actions section in the repository on github and create two GitHub secrets actions. \n\nThe first is a secret is called DOCKER_HUB_USERNAME which containes the name of the dockerhub account created and the second is called OCKER_HUB_ACCESS_TOKEN which contained a Personal Access Token (PAT) generated on dockerhub.\n\n__Set up the workflow__\n\u0026nbsp;\n\nThe next step is to set up the GitHub Actions workflow for building and pushing the image to Docker Hub. This is done by going to the actions section on the repo and selecting set up workflow which creates a Github actions work file contained in yaml format.\n\n\u0026nbsp;\n\n__Define the workflow steps__\n\u0026nbsp;\n\nThe  last step includes setting up the build context within the yaml file. The contains all the information for docker hub to copy to files mentioned in the dockerfile then build an image and automatically push to docker hub.\n\nThe last step is to commit the changes in the repo which would automatically start workflow. In order to make sure the workflow worked the image pushed on to docker hub was downloaded and a container was created and ran to ensure the script ran correctly.  A docker compose file which contains commands to self automate running containers was also created.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felijah-1994%2Fdata-collection-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Felijah-1994%2Fdata-collection-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Felijah-1994%2Fdata-collection-pipeline/lists"}