{"id":16556618,"url":"https://github.com/nuhmanpk/webscrapper","last_synced_at":"2025-04-12T22:24:35.084Z","repository":{"id":46863004,"uuid":"401544879","full_name":"nuhmanpk/WebScrapper","owner":"nuhmanpk","description":"Powerful Telegram bot for web scraping and crawling. Fast, easy, and loved by thousands!","archived":false,"fork":false,"pushed_at":"2025-02-25T18:41:44.000Z","size":532,"stargazers_count":154,"open_issues_count":3,"forks_count":92,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-04T02:40:27.298Z","etag":null,"topics":["beautifulsoup4","crawler","crawler-engine","crawler-python","hacktoberfest","hacktoberfest-accepted","hacktoberfest2023","pyrogram","pyrogram-bot","requests","scraper","scraping","selenium","telegram","telegram-bot","web-scraping","webscraping","webscrapper","webscrapping","webscrapping-python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nuhmanpk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"nuhmanpk","ko_fi":"nuhmanpk","custom":["https://paytm.me/yoB-s0a"]}},"created_at":"2021-08-31T02:12:35.000Z","updated_at":"2025-03-18T19:23:05.000Z","dependencies_parsed_at":"2023-10-11T18:46:09.200Z","dependency_job_id":"3b5ee3df-6f64-41e2-b497-da27d47bd354","html_url":"https://github.com/nuhmanpk/WebScrapper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuhmanpk%2FWebScrapper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuhmanpk%2FWebScrapper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuhmanpk%2FWebScrapper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nuhmanpk%2FWebScrapper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nuhmanpk","download_url":"https://codeload.github.com/nuhmanpk/WebScrapper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248638749,"owners_count":21137703,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup4","crawler","crawler-engine","crawler-python","hacktoberfest","hacktoberfest-accepted","hacktoberfest2023","pyrogram","pyrogram-bot","requests","scraper","scraping","selenium","telegram","telegram-bot","web-scraping","webscraping","webscrapper","webscrapping","webscrapping-python"],"created_at":"2024-10-11T20:05:13.476Z","updated_at":"2025-04-12T22:24:35.054Z","avatar_url":"https://github.com/nuhmanpk.png","language":"Python","funding_links":["https://github.com/sponsors/nuhmanpk","https://ko-fi.com/nuhmanpk","https://paytm.me/yoB-s0a"],"categories":[],"sub_categories":[],"readme":"# WebScrapperRoBot\nSimple , powerful and versatile web scraping tool designed to simplify the process of extracting data from websites. It features a user-friendly menu-driven interface and supports a wide range of data extraction options, including raw HTML, HTML elements, paragraphs, links, audios, and videos\n\n[![Visits](https://api.visitorbadge.io/api/visitors?path=[https://github.com/nuhmanpk/webscrapper](https://github.com/nuhmanpk/portfolio)\u0026countColor=%23007EC6\u0026label=Visits\u0026style=flat-square\u0026token=YOUR_API_TOKEN)](https://github.com/nuhmanpk/WebScrapper)\n\n**_NOTE:_** New Patch supports web crawling.\n\nScraping Options:\n\n1. **Full Content**\n1. **HTML Data**\n1. **All Links**\n1. **All Paragraphs**\n1. **All Images**\n1. **All Audio**\n1. **All Video**\n1. **All PDFs**\n1. **Cookies**\n1. **LocalStorage**\n1. **Metadata**\n1. **Web ScreenShot\\***\n1. **Web Recording\\***\n1. **Web Crawler**\n1. **Got Something New? Add [here](https://github.com/nuhmanpk/WebScrapper/fork)**\n\n*These features are still under development \n\n\n![Menu](./demos/updated-new-menu.png)\n\n![Video Scraping](./demos/video-scraping.png)\n\n## Run Bot in Google Colab for free\n\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nuhmanpk/WebScrapper/blob/main/WebScrapper.ipynb)\n\n## Key Features:\nUser-Friendly Menu-Driven Interface: Navigate easily through the bot's features using a simple and intuitive menu system.\n\nComprehensive Data Extraction: Extract a variety of data types, including raw HTML, HTML elements, paragraphs, links, audios, and videos.\n\nRobust Error Handling: Handle unexpected errors and receive informative error messages to identify and resolve issues.\n\n## Use Cases:\nGather information from websites for research or analysis.\n\nMonitor competitor prices and product information.\n\nCollect data for marketing and lead generation.\n\nExtract news articles or social media posts for sentiment analysis.\n\n# Setting Up a Project and Configuring Environment Variables\n\nTo set up the project and configure environment variables, follow these steps:\n\n### 1. Clone the Repository\n\nClone the project's repository from your preferred version control platform (e.g., Git) to your local machine.\n\n```bash\ngit clone https://github.com/nuhmanpk/WebScrapper.git\n```\n\n### 2. Virtual Environment (Optional)\n\nIt's a good practice to create a virtual environment for the project. You can use virtualenv or venv for this purpose.\n\n```bash\npython -m venv venv\nsource venv/bin/activate  # On Unix/Linux systems\n```\n\n3. Install Dependencies\nUse pip to install the project is dependencies from the requirements.txt file.\n\n```bash\npip install -r requirements.txt\n```\n4. Create the .env File\nCreate a .env file in the project is root directory. This file will contain the necessary environment variables. You can either copy a sample file or create it manually.\n\n5. Configure Environment Variables\nOpen the .env file and set the required environment variables in the format VARIABLE_NAME=value. For example:\n\n```env\nBOT_TOKEN=your_bot_token_here\nAPI_ID=your_api_id_here\nAPI_HASH=your_api_hash_here\n```\n6. Run the Project\nExecute the project using the appropriate command (e.g., python my_project.py) and access your environment variables in the code to retrieve configurations.\n\n7. Consider Secret Management (Optional)\n\nIf you deploy your project on a cloud server, consider using a secrets manager like AWS Secrets Manager, Google Secret Manager, or a similar service. This will help you securely store your configurations in a production environment.\n\n## What is Web Scraping ?\n  Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser.\n## Is web scraping Legal?\n  Web scraping itself is not illegal. As a matter of fact, web scraping – or web crawling, were historically associated with well-known search engines like Google or Bing. These search engines crawl sites and index the web. ... A great example when web scraping can be illegal is when you try to scrape nonpublic data.\n## Why web scraping is Done?\n  Web scraping is used in a variety of digital businesses that rely on data harvesting. Legitimate use cases include: Search engine bots crawling a site, analyzing its content and then ranking it. ... Market research companies using scrapers to pull data from forums and social media (e.g., for sentiment analysis).\n## Where can I use web scraping?\n  Lead Generation for Marketing. A web scraping software can be used to generate leads for marketing,Price Comparison \u0026 Competition Monitoring,E-Commerce,Real Estate,Data Analysis,Academic Research,Training and Testing Data for Machine Learning Projects,,Sports Betting Odds Analysis.\n## Are there any Limitations?\n   Learning curve, Even the easiest scraping tool takes time to master,The structure of websites change frequently,Scraped data is arranged according to the structure of the website,It is not easy to handle complex websites,To extract data on a large scale is way harder,A web scraping tool is not omnipotent\n\n[Take a Demo Here](http://t.me/web_scrapper_robot)\n\n## Ethical Considerations\n\nRespect the rights of website owners: Website owners have the right to control how their content is used. Scraping a website without permission can be considered trespassing or copyright infringement.\n\nDon't overload websites: Scraping a website too frequently can overload its servers and make it unavailable to other users. Be mindful of the website's load when scraping data.\n\nUse robots.txt: Robots.txt is a file that website owners can use to specify which pages they do not want scraped. Respect the robots.txt file and avoid scraping pages that are disallowed.\n\nIdentify yourself as a scraper: When scraping a website, identify yourself as a scraper in your user agent string. This will help website owners to understand who is accessing their site and for what purpose.\n\nBe transparent about your intentions: If you are scraping a website for commercial purposes, be transparent about your intentions. This will help to build trust with website owners and users.\n\n## Safety Guidelines\n\nNever scrape personal information: Scraping personal information, such as names, addresses, and email addresses, is a violation of privacy. Never scrape personal information without the explicit consent of the individuals involved.\n\nAvoid scraping sensitive data: Avoid scraping data that could be used to harm individuals or organizations, such as financial information, medical records, or trade secrets.\n\nBe cautious about scraping social media: Social media platforms have their own terms of service that govern how data can be scraped. Be sure to comply with these terms of service when scraping social media data.\n\n## Contributors\n\n![GitHub Contributors Image](https://contrib.rocks/image?repo=bughunter0/WebScrapperRoBot)\n\n## Support\n\nShow your support [here](https://github.com/sponsors/nuhmanpk)\n\n\u003cBr\u003e\u003cb\u003eMark your Star [⭐⭐](https://github.com/nuhmanpk/WebScrapper/stargazers)\u003cb\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnuhmanpk%2Fwebscrapper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnuhmanpk%2Fwebscrapper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnuhmanpk%2Fwebscrapper/lists"}