{"id":26205088,"url":"https://github.com/mauropelucchi/europython2021","last_synced_at":"2026-05-10T16:45:08.205Z","repository":{"id":129314034,"uuid":"388242155","full_name":"mauropelucchi/europython2021","owner":"mauropelucchi","description":"Data Ingestion and Big Data @ EUROPYTHON 2021","archived":false,"fork":false,"pushed_at":"2021-07-22T22:33:09.000Z","size":15359,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-13T09:44:50.170Z","etag":null,"topics":["europython2021","notebook","scraping","selenium"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mauropelucchi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-21T20:48:44.000Z","updated_at":"2021-11-02T12:17:33.000Z","dependencies_parsed_at":"2023-07-28T10:00:57.761Z","dependency_job_id":null,"html_url":"https://github.com/mauropelucchi/europython2021","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mauropelucchi/europython2021","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauropelucchi%2Feuropython2021","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauropelucchi%2Feuropython2021/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauropelucchi%2Feuropython2021/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauropelucchi%2Feuropython2021/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mauropelucchi","download_url":"https://codeload.github.com/mauropelucchi/europython2021/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mauropelucchi%2Feuropython2021/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264341083,"owners_count":23593294,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["europython2021","notebook","scraping","selenium"],"created_at":"2025-03-12T04:33:49.558Z","updated_at":"2026-05-10T16:45:03.165Z","avatar_url":"https://github.com/mauropelucchi.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Ingestion and Big Data @ EUROPYTHON 2021\n## Build a dataset from zero to solid\n\n![](https://raw.githubusercontent.com/mauropelucchi/europython2021/main/img/logo.png)\n\nWeb scraping, crawling and API are the first step to retrieve information to use for analysis\nand to start a new business.\nIn this tutorial I'll show you how to use python to set up scraping and crawling processes,\nhow to simulate users navigation and browser behavior with a ghost browser and how to hook up and use data APIs.\nI will also try to explain the technical and ethical aspects that we have to consider when we approach these kinds of challenges.\n\n## About\n\nThis repo contains slides and code for Mauro Pelucchi's \"Data Ingestion and Big Data\" @ EuroPython 2021.\n\n- [Access slides](https://github.com/mauropelucchi/europython2021/blob/main/slide/EUROPYTHON2021_BigData_Data_Ingestion.pdf)\n- [Access notebook (colab)](https://github.com/mauropelucchi/europython2021/blob/main/notebook/EUROPYTHON_2021_Web_Scraping_with_Selenium.ipynb)\n\n## Running the notebook\n\nI **highly** recommend using the Colab, but in case you want to do it on your local notebook, follow the steps below:\n\n1. Download ChromeDriver [https://chromedriver.chromium.org/downloads](https://chromedriver.chromium.org/downloads)\n2. Import these libs\n\n```\nimport sys\nimport logging\nfrom selenium.webdriver.remote.remote_connection import LOGGER\nLOGGER.setLevel(logging.WARNING)\nfrom selenium import webdriver\nfrom tqdm import tqdm_notebook as tqdm\nimport pandas\nimport json\nimport pprint\n```\n3. Create your ChromeDriver\n```\nwd = webdriver.Chrome('\u003cpath where you stored chromedriver\u003e/chromedriver',chrome_options=chrome_options)\n```\n\n# MIT License\n\nCopyright (c) 2021 Mauro Pelucchi\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmauropelucchi%2Feuropython2021","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmauropelucchi%2Feuropython2021","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmauropelucchi%2Feuropython2021/lists"}