{"id":17363461,"url":"https://github.com/codingforentrepreneurs/Smarter-Web-Scraping-with-Python","last_synced_at":"2025-02-26T12:32:57.737Z","repository":{"id":222356974,"uuid":"757012393","full_name":"codingforentrepreneurs/Smarter-Web-Scraping-with-Python","owner":"codingforentrepreneurs","description":"Leverage modern open-source tools to create better web scraping workflows. ","archived":false,"fork":false,"pushed_at":"2024-02-29T19:42:28.000Z","size":107,"stargazers_count":24,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-10-10T05:38:37.881Z","etag":null,"topics":["apple-itunes-search-api","brightdata","gpt","gpt3","hacker-news","itunes-podcast-api","llama2","llm","ollama","open-source","openai","podcast","proxy-scraper","python3","selenium","whisper"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/codingforentrepreneurs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-13T18:27:07.000Z","updated_at":"2024-10-02T09:46:01.000Z","dependencies_parsed_at":"2024-02-29T20:48:05.325Z","dependency_job_id":null,"html_url":"https://github.com/codingforentrepreneurs/Smarter-Web-Scraping-with-Python","commit_stats":{"total_commits":27,"total_committers":2,"mean_commits":13.5,"dds":"0.11111111111111116","last_synced_commit":"4479baa084316251b0b45dc1f9cefe76b6eee908"},"previous_names":["codingforentrepreneurs/web-scraping-with-python-selenium-and-open-source-llm","codingforentrepreneurs/smarter-web-scraping-with-python"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingforentrepreneurs%2FSmarter-Web-Scraping-with-Python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingforentrepreneurs%2FSmarter-Web-Scraping-with-Python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingforentrepreneurs%2FSmarter-Web-Scraping-with-Python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/codingforentrepreneurs%2FSmarter-Web-Scraping-with-Python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/codingforentrepreneurs","download_url":"https://codeload.github.com/codingforentrepreneurs/Smarter-Web-Scraping-with-Python/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219842563,"owners_count":16556526,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-itunes-search-api","brightdata","gpt","gpt3","hacker-news","itunes-podcast-api","llama2","llm","ollama","open-source","openai","podcast","proxy-scraper","python3","selenium","whisper"],"created_at":"2024-10-15T20:00:56.744Z","updated_at":"2025-02-26T12:32:57.659Z","avatar_url":"https://github.com/codingforentrepreneurs.png","language":"Jupyter Notebook","readme":"# Smater Web Scraping with Python Selenium and Llama2\nGenerate podcast clips related to daily top submissions on Hacker News via web scraping with Python \u0026 Selenium, generative ai with Ollama and LLama2, Transcript generation OpenAI Whisper, iTunes Podcast Search, and more.\n\nComing soon\n\n\n## Requirements\n- Python 3.10 and up\n- A [Bright Data Account](https://brdta.com/justin) (includes $25 credit)\n- ffmpeg (required for transcribing audio with OpenAI Whisper)\n\n\n### A Proxy-based Web Scraping approach\nIn this repo, we use a web scraping proxy service from Bright Data. Using a proxy service makes our requests more reliable. You can see the actual code for the Selenium-based remote connection here [src/helpers/brightdata.py](./src/helpers/brightdata.py).\n\n#### With Remote Proxy\nour computer -\u003e request -\u003e proxy -\u003e web server -\u003e proxy -\u003e response -\u003e our computer\n\n#### Without Remote Proxy\nour computer -\u003e request -\u003e web server -\u003e response -\u003e our computer\n\n### Usage\n```python\n# from 'src/2 - Connection Sample.ipynb'\nfrom selenium.webdriver import Remote, ChromeOptions\n\n# import this function\nfrom helpers.brightdata import get_sbr_connection\n\noptions = ChromeOptions()\n\n# options.headless = True # old method\noptions.add_argument(\"--headless=new\") # new method\n\nurl = 'https://news.ycombinator.com'\n\nwith Remote(sbr_connection, options=options) as driver:\n    driver.get(url)\n    print(driver.page_source)\n```\n\n\n\n## Getting Started\n\n### Clone project\n```\nmkdir -p ~/dev/smarter-scraping\ncd ~/dev/smarter-scraping\ngit clone https://github.com/codingforentrepreneurs/Smarter-Web-Scraping-with-Python .\n```\n\n### (Optional) Working through the course?\nUse the `course_start` branch with:\n\nmac/linux\n```bash\ngit checkout course_start\nrm -rf .git \ngit init\n```\n\nwindows\n```powershell\ngit checkout course_start\nRemove-Item .git -Recurse -Force\ngit init\n```\n## Create a Python Virtual Environment\n\n```bash\ncd ~/dev/smarter-scraping # or where you cloned the repo\n```\n\nmac/linux\n```bash\npython3 -m venv venv\n```\n\nwindows\n```powershell\nc:\\Python311\\python.exe -m venv venv\n```\n\n### Activate the virtual enviornment\nAlways activate your environment!\n\n```bash\ncd ~/dev/smarter-scraping # or where you cloned the repo\n```\n\nmac/linux\n```bash\nsource venv/bin/activate\n```\n\nwindows\n```powershell\n.\\venv\\Scripts\\activate\n```\n\nIf done correctly, your command line should start with `(venv)`\n\n\n### Install requirements\nWith virtual envionoment activated (e.g. `(venv)`), run:\n\n```bash\n(venv) python -m pip install pip --upgrade\n(venv) python -m pip install -r requirements.txt\n```\n\n### Implement Environment Variables with `dotenv`\n\nmac/linux\n```bash\ncp sample-env-file .env\n```\n\nwindows\n```powershell\nCopy-Item .env.sample -Destination .env\n```\n\nBe sure to add your Bright Data proxy information:\n- `BRIGHT_DATA_USERNAME`\n- `BRIGHT_DATA_PASSWORD`\n- `BRIGHT_DATA_HOST`\n\nAdd Ollama data too (for Running the OpenAI drop-in replacement Llama2)\n- `OPENAI_BASE_URL=http://localhost:11434/v1`\n- `OPENAI_API_KEY=ollama`\n- `OPENAI_COMPLETION_MODEL=llama2`\n\n### Loading Environment Variables\n\nWith code that lives inside the `src/` directory, you can import the `helpers` module to load your environment variables. \n\nWe created a simple function to extend the incredible [python-decouple](https://pypi.org/project/python-decouple/) package (it's in [src/helpers/env.py](./src/helpers/env.py)):\n\n```python\nimport helpers\n\nMY_VAR = helpers.config('MY_VAR', default=\"Not set\", cast=str)\n```\n\n\n### Run Jupyter\nExplore the notebooks!\n```\njupyter notebook\n```\n\n","funding_links":[],"categories":["Jupyter Notebook"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodingforentrepreneurs%2FSmarter-Web-Scraping-with-Python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcodingforentrepreneurs%2FSmarter-Web-Scraping-with-Python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcodingforentrepreneurs%2FSmarter-Web-Scraping-with-Python/lists"}