{"id":22555779,"url":"https://github.com/hamidurrk/epaper-scraper","last_synced_at":"2025-03-28T11:13:16.761Z","repository":{"id":221526710,"uuid":"754616837","full_name":"hamidurrk/epaper-scraper","owner":"hamidurrk","description":"Web scraper for extracting data from online newspapers","archived":false,"fork":false,"pushed_at":"2024-05-29T02:29:47.000Z","size":20085,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-05-29T07:23:10.176Z","etag":null,"topics":["asynchronous-programming","beautifulsoup4","cuda-toolkit","dataminig","lxml","python","selenium-python","sqlite3","tesseract","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hamidurrk.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-08T12:41:13.000Z","updated_at":"2024-06-01T09:55:49.968Z","dependencies_parsed_at":"2024-02-08T14:43:43.617Z","dependency_job_id":"23977652-0445-435a-9bd7-4e7f2bb975dc","html_url":"https://github.com/hamidurrk/epaper-scraper","commit_stats":null,"previous_names":["hamidurrk/epaper-scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hamidurrk%2Fepaper-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hamidurrk%2Fepaper-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hamidurrk%2Fepaper-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hamidurrk%2Fepaper-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hamidurrk","download_url":"https://codeload.github.com/hamidurrk/epaper-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246017731,"owners_count":20710240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asynchronous-programming","beautifulsoup4","cuda-toolkit","dataminig","lxml","python","selenium-python","sqlite3","tesseract","webscraping"],"created_at":"2024-12-07T19:08:57.127Z","updated_at":"2025-03-28T11:13:16.737Z","avatar_url":"https://github.com/hamidurrk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# E-paper Scraper\n\nThis is a Python-based web scraper for extracting data from online newspapers. Tested on [Jugantor](https://epaper.jugantor.com/) and [Prothom Alo](https://epaper.prothomalo.com/Home/) newspaper from 2012 to 2024 in a Windows 11 machine.\n\n## Installation\n\n### Prerequisites\n\nBefore installing the scraper, ensure you have the following prerequisites:\n\n- A Windows machine (Tested on Windows 11)\n- [Python 3.12.0](https://www.python.org/downloads/release/python-3120/) installed on your system \n- [Anaconda](https://www.anaconda.com/download/) for installing some packages\n- [NVIDIA CUDA Toolkit 12.4](https://developer.nvidia.com/cuda-downloads?target_os=Windows) for GPU accelerated processes\n- [Firefox](https://www.mozilla.org/en-US/firefox/new/) installed as your browser. Make sure to install it in `C:\\Program Files\\Mozilla Firefox\\`\n\n### Installation Steps\n\n#### CUDA Toolkits\n\n*Only NVIDIA GPUs are supported for now and the ones which are listed on [this page](https://developer.nvidia.com/cuda-gpus). If your graphics card has CUDA cores, then you can proceed further with setting up things. If not, contact the developer*.\n\n1. Make sure that Nvidia drivers are upto date.\n\n2. Add anaconda to the environment and run the following commands in the command prompt.\n\n```bash\nconda install numba\nconda install cudatoolkit\n```\n*__NOTE:__ If Anaconda is not added to the environment then navigate to anaconda installation and locate the Scripts directory and open the command prompt there*. \n\n#### Tesseract\n\n1. Download the Tesseract OCR executable from [here](https://github.com/UB-Mannheim/tesseract/wiki).\n\n2. Install Tesseract OCR by following the installation instructions provided in the repository. Make sure to install it in `C:\\Program Files (x86)\\Tesseract-OCR`.\n\n3. Open a command prompt or Anaconda prompt.\n\n4. Navigate to the directory where you have cloned or downloaded the epaper-scraper repository.\n\n5. Create and activate a virtual environment (optional but recommended):\n\n    ```bash\n    python -m venv venv\n    venv\\Scripts\\activate\n    ```\n\n6. Install the required Python packages using pip:\n\n    ```bash\n    pip install -r requirements.txt\n    ```\n\n7. Test if Tesseract OCR is installed correctly by opening a Python prompt and running:\n\n    ```python\n    import pytesseract\n    print(pytesseract)\n    ```\n\n    If you don't encounter any errors, Tesseract OCR is installed successfully.\n\n## Usage\n\nThere are two ways to use this software: **With GUI** and **Without GUI**.\n\nTo use the epaper-scraper **With GUI**, follow these steps:\n\n1. Run `main.py` from `src`, which will initiate a desktop application like the following one:\n\n![Epaper Scraper Interface](resources/gui_ss_1.png)\n\n2. Navigate through the interface for using the supported capabilities of the software.\n\n*__Note:__ The GUI lacks advanced features which are available in the \"Without GUI\" version. The interface is being constantly updated to implement these features.*\n\nTo use the **advanced features** of epaper-scraper **Without GUI**, follow these steps:\n\n1. Click and run `start_firefox.bat` file. Alternatively run the commands from `cmd.txt`. This will initilize a firefox browser instance. \n\n2. Call functions and adjust parameters from the python files of `src` and run.\nExample:\n    ```bash\n    python main.py\n    ```\n\n3. The scraper will start extracting data from the specified newspaper website and save it to the specified output directory.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhamidurrk%2Fepaper-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhamidurrk%2Fepaper-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhamidurrk%2Fepaper-scraper/lists"}