{"id":18577073,"url":"https://github.com/brianlesko/web-scraper","last_synced_at":"2025-10-17T15:34:16.666Z","repository":{"id":207863124,"uuid":"720244394","full_name":"BrianLesko/web-scraper","owner":"BrianLesko","description":"a web scraping app, paste a URL and download the text or links on the website","archived":false,"fork":false,"pushed_at":"2024-06-23T19:53:06.000Z","size":7134,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-17T15:23:17.675Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BrianLesko.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-17T22:54:46.000Z","updated_at":"2024-11-15T18:03:26.000Z","dependencies_parsed_at":"2024-01-18T01:06:03.815Z","dependency_job_id":"b2173b49-23a0-441c-895b-d645e906e36c","html_url":"https://github.com/BrianLesko/web-scraper","commit_stats":null,"previous_names":["brianlesko/web-scraper"],"tags_count":0,"template":true,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrianLesko%2Fweb-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrianLesko%2Fweb-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrianLesko%2Fweb-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BrianLesko%2Fweb-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BrianLesko","download_url":"https://codeload.github.com/BrianLesko/web-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254448614,"owners_count":22072765,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T23:27:44.082Z","updated_at":"2025-10-17T15:34:16.555Z","avatar_url":"https://github.com/BrianLesko.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Web Scraping\nThis code implements a web scraper for creating text files and a list of links. Altough this is a simpler implementation, similar approaches are used to train AI models utilizing internet data - especially machine learning models like OpenAI's ChatGPT. This implementation is written in [Pure Python](). Created for Learning Purposes.\n\n\n\u0026nbsp;\n\n\u003cdiv align=\"center\"\u003e\u003cimg src=\"docs/preview.png\" width=\"800\"\u003e\u003c/div\u003e\n\n\u0026nbsp;\n\n## Dependencies\n\nThis code uses the following libraries:\n- `streamlit`: for building the user interface.\n- `numpy`: for creating arrays.\n- `pandas`: for creating dataframes.\n- `bs4`: for picking the text out of a webpage's HTML code, a process known as parsing.\n- `requests`: for retreiving the HTML of a webpage.\n\n\n\u0026nbsp;\n\n## Usage\n\nRun the following commands in your terminal:\n```\npython3 -m venv my_env\nsource my_env/bin/activate # Mac OS or Linux\n.\\my_env\\Scripts\\activate # Windows\npip install --upgrade streamlit numpy pandas bs4 requests\nstreamlit run https://raw.githubusercontent.com/BrianLesko/text-similarity-search/main/app.py\n```\n\nThis will start the Streamlit server, and you can access the chatbot by opening a web browser and navigating to `http://localhost:8501`.\n\n\u0026nbsp;\n\n## How it Works\n\nThe web scraper works as follows:\n1. The user enters a URL in the input field.\n2. Requests retrieves the relevant HTML based on the user's URL.\n3. bs4 parses the HTML code that makes up the website into text and links.\n4. The chatbot displays some information about the text it parsed.\n5. The option to download the text or links appears.\n\n\u0026nbsp;\n\n## Repository Structure\n```\ndoc-chat/\n├── .streamlit/\n│   └── config.toml # theme info for the UI\n├── docs/\n│   └── preview.png\n├── app.py # the code and UI integrated together live here\n├── customize_gui # for adding gui elements like the about sidebar\n├── requirements.txt # the python packages needed to run locally\n└── .gitignore # includes the local virtual environment named my_env\n```\n\n\u0026nbsp;\n\n## Topics \n```\nPython | Streamlit | Git | Low Code UI\nChat interface | Web scraping | HTML Parsing\nSelf taught coding | Mechanical engineer | Robotics engineer\n```\n\u0026nbsp;\n\n\u003chr\u003e\n\n\u0026nbsp;\n\n\u003cdiv align=\"center\"\u003e\n\n\n\n╭━━╮╭━━━┳━━┳━━━┳━╮╱╭╮        ╭╮╱╱╭━━━┳━━━┳╮╭━┳━━━╮\n┃╭╮┃┃╭━╮┣┫┣┫╭━╮┃┃╰╮┃┃        ┃┃╱╱┃╭━━┫╭━╮┃┃┃╭┫╭━╮┃\n┃╰╯╰┫╰━╯┃┃┃┃┃╱┃┃╭╮╰╯┃        ┃┃╱╱┃╰━━┫╰━━┫╰╯╯┃┃╱┃┃\n┃╭━╮┃╭╮╭╯┃┃┃╰━╯┃┃╰╮┃┃        ┃┃╱╭┫╭━━┻━━╮┃╭╮┃┃┃╱┃┃\n┃╰━╯┃┃┃╰┳┫┣┫╭━╮┃┃╱┃┃┃        ┃╰━╯┃╰━━┫╰━╯┃┃┃╰┫╰━╯┃\n╰━━━┻╯╰━┻━━┻╯╱╰┻╯╱╰━╯        ╰━━━┻━━━┻━━━┻╯╰━┻━━━╯\n  \n\n\n\u0026nbsp;\n\n\n\u003ca href=\"https://twitter.com/BrianJosephLeko\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/BrianLesko/BrianLesko/f7be693250033b9d28c2224c9c1042bb6859bfe9/.socials/svg-white/x-logo-white.svg\" width=\"30\" alt=\"X Logo\"\u003e\u003c/a\u003e \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u003ca href=\"https://github.com/BrianLesko\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/BrianLesko/BrianLesko/f7be693250033b9d28c2224c9c1042bb6859bfe9/.socials/svg-white/github-mark-white.svg\" width=\"30\" alt=\"GitHub\"\u003e\u003c/a\u003e \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u003ca href=\"https://www.linkedin.com/in/brianlesko/\"\u003e\u003cimg src=\"https://raw.githubusercontent.com/BrianLesko/BrianLesko/f7be693250033b9d28c2224c9c1042bb6859bfe9/.socials/svg-white/linkedin-icon-white.svg\" width=\"30\" alt=\"LinkedIn\"\u003e\u003c/a\u003e\n\nfollow all of these or i will kick you\n\n\u003c/div\u003e\n\n\n\u0026nbsp;\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrianlesko%2Fweb-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrianlesko%2Fweb-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrianlesko%2Fweb-scraper/lists"}