{"id":13617024,"url":"https://github.com/niqdev/packtpub-crawler","last_synced_at":"2025-04-12T22:38:47.312Z","repository":{"id":33765612,"uuid":"37421331","full_name":"niqdev/packtpub-crawler","owner":"niqdev","description":"Download your daily free Packt Publishing eBook https://www.packtpub.com/packt/offers/free-learning","archived":false,"fork":false,"pushed_at":"2022-07-06T20:08:13.000Z","size":165,"stargazers_count":758,"open_issues_count":21,"forks_count":177,"subscribers_count":69,"default_branch":"master","last_synced_at":"2025-04-04T02:08:21.833Z","etag":null,"topics":["docker","firebase","free-ebook","google-drive","heroku","ifttt","onedrive","packtpub"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/niqdev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-14T17:03:44.000Z","updated_at":"2025-03-12T10:58:23.000Z","dependencies_parsed_at":"2022-07-18T06:00:35.621Z","dependency_job_id":null,"html_url":"https://github.com/niqdev/packtpub-crawler","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/niqdev%2Fpacktpub-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/niqdev%2Fpacktpub-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/niqdev%2Fpacktpub-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/niqdev%2Fpacktpub-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/niqdev","download_url":"https://codeload.github.com/niqdev/packtpub-crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248643042,"owners_count":21138353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docker","firebase","free-ebook","google-drive","heroku","ifttt","onedrive","packtpub"],"created_at":"2024-08-01T20:01:35.992Z","updated_at":"2025-04-12T22:38:47.292Z","avatar_url":"https://github.com/niqdev.png","language":"Python","funding_links":[],"categories":["Python","heroku"],"sub_categories":[],"readme":"# packtpub-crawler\n\n### Download FREE eBook every day from [www.packtpub.com](https://www.packtpub.com/packt/offers/free-learning)\n\nThis crawler automates the following step:\n\n* access to private account\n* claim the daily free eBook and weekly Newsletter\n* parse title, description and useful information\n* download favorite format *.pdf .epub .mobi*\n* download source code and book cover\n* upload files to Google Drive, OneDrive or via scp\n* store data on Firebase\n* notify via Gmail, IFTTT, Join or Pushover (on success and errors)\n* schedule daily job on Heroku or with Docker\n\n### Default command\n```bash\n# upload pdf to googledrive, store data and notify via email\npython script/spider.py -c config/prod.cfg -u googledrive -s firebase -n gmail\n```\n\n### Other options\n```bash\n# download all format\npython script/spider.py --config config/prod.cfg --all\n\n# download only one format: pdf|epub|mobi\npython script/spider.py --config config/prod.cfg --type pdf\n\n# download also additional material: source code (if exists) and book cover\npython script/spider.py --config config/prod.cfg -t pdf --extras\n# equivalent (default is pdf)\npython script/spider.py -c config/prod.cfg -e\n\n# download and then upload to Google Drive (given the download url anyone can download it)\npython script/spider.py -c config/prod.cfg -t epub --upload googledrive\npython script/spider.py --config config/prod.cfg --all --extras --upload googledrive\n\n# download and then upload to OneDrive (given the download url anyone can download it)\npython script/spider.py -c config/prod.cfg -t epub --upload onedrive\npython script/spider.py --config config/prod.cfg --all --extras --upload onedrive\n\n# download and notify: gmail|ifttt|join|pushover\npython script/spider.py -c config/prod.cfg --notify gmail\n\n# only claim book (no downloads):\npython script/spider.py -c config/prod.cfg --notify gmail --claimOnly\n```\n\n### Basic setup\n\nBefore you start you should\n\n* Verify that your currently installed version of Python is **2.x** with `python --version`\n* Clone the repository `git clone https://github.com/niqdev/packtpub-crawler.git`\n* Install all the dependencies `pip install -r requirements.txt` (see also [virtualenv](https://github.com/niqdev/packtpub-crawler#virtualenv))\n* Create a [config](https://github.com/niqdev/packtpub-crawler/blob/master/config/prod_example.cfg) file `cp config/prod_example.cfg config/prod.cfg`\n* Change your Packtpub credentials in the config file\n```\n[credential]\ncredential.email=PACKTPUB_EMAIL\ncredential.password=PACKTPUB_PASSWORD\n```\n\nNow you should be able to claim and download your first eBook\n```\npython script/spider.py --config config/prod.cfg\n```\n\n### Google Drive\n\nFrom the documentation, Google Drive API requires OAuth2.0 for authentication, so to upload files you should:\n\n* Go to [Google APIs Console](https://code.google.com/apis/console) and create a new [Google Drive](https://console.developers.google.com/apis/api/drive/overview) project named **PacktpubDrive**\n* On *API manager \u003e Overview* menu\n  * Enable Google Drive API\n* On *API manager \u003e Credentials* menu\n  * In *OAuth consent screen* tab set **PacktpubDrive** as the product name shown to users\n  * In *Credentials* tab create credentials of type *OAuth client ID* and choose Application type *Other* named **PacktpubDriveCredentials**\n* Click *Download JSON* and save the file `config/client_secrets.json`\n* Change your Google Drive credentials in the config file\n\n```\n[googledrive]\n...\ngoogledrive.client_secrets=config/client_secrets.json\ngoogledrive.gmail=GOOGLE_DRIVE@gmail.com\n```\n\nNow you should be able to upload your eBook to Google Drive\n```\npython script/spider.py --config config/prod.cfg --upload googledrive\n```\n\nOnly the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generate `config/auth_token.json`.\nYou should also copy and paste in the config the *FOLDER_ID*, otherwise every time a new folder with the same name will be created.\n```\n[googledrive]\n...\ngoogledrive.default_folder=packtpub\ngoogledrive.upload_folder=FOLDER_ID\n```\n\nDocumentation: [OAuth](https://developers.google.com/api-client-library/python/guide/aaa_oauth), [Quickstart](https://developers.google.com/drive/v3/web/quickstart/python), [example](https://github.com/googledrive/python-quickstart) and [permissions](https://developers.google.com/drive/v2/reference/permissions)\n\n### OneDrive\n\nFrom the documentation, OneDrive API requires OAuth2.0 for authentication, so to upload files you should:\n\n\n* Go to the [Microsoft Application Registration Portal](https://apps.dev.microsoft.com/?referrer=https%3A%2F%2Fdev.onedrive.com%2Fapp-registration.htm).\n* When prompted, sign in with your Microsoft account credentials.\n* Find **My applications** and click **Add an app**.\n* Enter **PacktpubDrive** as the app's name and click **Create application**.\n* Scroll to the bottom of the page and check the **Live SDK support** box.\n* Change your OneDrive credentials in the config file\n  * Copy your **Application Id** into the config file to **onedrive.client_id**\n  * Click **Generate New Password** and copy the password shown into the config file to **onedrive.client_secret**\n  * Click **Add Platform** and select **Web**\n  * Enter **http://localhost:8080/** as the **Redirect URL**\n  * Click **Save** at the bottom of the page\n\n```\n[onedrive]\n...\nonedrive.client_id=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\nonedrive.client_secret=XxXxXxXxXxXxXxXxXxXxXxX\n```\n\nNow you should be able to upload your eBook to OneDrive\n```\npython script/spider.py --config config/prod.cfg --upload onedrive\n```\n\nOnly the first time you will be prompted to login in a browser which has javascript enabled (no text-based browser) to generate `config/session.onedrive.pickle`.\n```\n[onedrive]\n...\nonedrive.folder=packtpub\n```\n\nDocumentation: [Registration](https://dev.onedrive.com/app-registration.htm), [Python API](https://github.com/OneDrive/onedrive-sdk-python)\n\n### Scp\n\nTo upload your eBook via `scp` on a remote server update the configs\n\n```\n[scp]\nscp.host=SCP_HOST\nscp.user=SCP_USER\nscp.password=SCP_PASSWORD\nscp.path=SCP_UPLOAD_PATH\n```\n\nNow you should be able to upload your eBook\n```\npython script/spider.py --config config/prod.cfg --upload scp\n```\n\nNote:\n* the destination folder `scp.path` on the remote server must exists in advance\n* the option `--upload scp` is incompatible with `--store` and `--notify`\n\n### Firebase\n\nCreate a new Firebase [project](https://console.firebase.google.com/), copy the database secret from your settings\n```\nhttps://console.firebase.google.com/project/PROJECT_NAME/settings/database\n```\nand update the configs\n```\n[firebase]\nfirebase.database_secret=DATABASE_SECRET\nfirebase.url=https://PROJECT_NAME.firebaseio.com\n```\n\nNow you should be able to store your eBook details on Firebase\n```\npython script/spider.py --config config/prod.cfg --upload googledrive --store firebase\n```\n\n### Gmail notification\n\nTo *send* a notification via email using Gmail you should:\n\n* Allow [\"less secure apps\"](https://www.google.com/settings/security/lesssecureapps) and [\"DisplayUnlockCaptcha\"](https://accounts.google.com/DisplayUnlockCaptcha) on your account\n* [Troubleshoot](https://support.google.com/mail/answer/78754) sign-in problems and [examples](http://stackoverflow.com/questions/10147455/how-to-send-an-email-with-gmail-as-provider-using-python)\n* Change your Gmail credentials in the config file\n\n```\n[gmail]\n...\ngmail.username=EMAIL_USERNAME@gmail.com\ngmail.password=EMAIL_PASSWORD\ngmail.from=FROM_EMAIL@gmail.com\ngmail.to=TO_EMAIL_1@gmail.com,TO_EMAIL_2@gmail.com\n```\n\nNow you should be able to notify your accounts\n```\npython script/spider.py --config config/prod.cfg --notify gmail\n```\n\n### IFTTT notification\n\n* Get an account on [IFTTT](https://ifttt.com)\n* Go to your Maker [settings](https://ifttt.com/services/maker/settings) and activate the channel\n* [Create](https://ifttt.com/create) a new applet using the Maker service with the trigger \"Receive a web request\" and the event name \"packtpub-crawler\"\n* Change your IFTTT [key](https://internal-api.ifttt.com/maker) in the config file\n\n```\n[ifttt]\nifttt.event_name=packtpub-crawler\nifttt.key=IFTTT_MAKER_KEY\n```\n\nNow you should be able to trigger the applet\n```\npython script/spider.py --config config/prod.cfg --notify ifttt\n```\n\nValue mappings:\n* value1: title\n* value2: description\n* value3: landing page URL\n\n### Join notification\n\n* Get the Join [Chrome extension](https://chrome.google.com/webstore/detail/join-by-joaoapps/flejfacjooompmliegamfbpjjdlhokhj) and/or [App](https://play.google.com/store/apps/details?id=com.joaomgcd.join)\n* You can find your device ids [here](https://joinjoaomgcd.appspot.com/)\n* (Optional) You can use multiple devices or groups (group.all, group.android, group.chrome, group.windows10, group.phone, group.tablet, group.pc) separated by comma\n* Change your Join credentials in the config file\n\n```\n[join]\njoin.device_ids=DEVICE_IDS_COMMA_SEPARATED_OR_GROUP_NAME\njoin.api_key=API_KEY\n```\n\nNow you should be able to trigger the event\n```\npython script/spider.py --config config/prod.cfg --notify join\n```\n\n### Pushover notification\n\n* Get your [USER_KEY](https://pushover.net/)\n* Create a [new application](https://pushover.net/apps/build)\n* (Optional) Add an [icon](https://pushover.net/icons/9aqpv697p9g6wzo.png)\n* Change your pushover credentials in the config file\n\n```\n[pushover]\npushover.user_key=PUSHOVER_USER_KEY\npushover.api_key=PUSHOVER_API_KEY\n```\n\n### Heroku\n\nCreate a new branch\n```\ngit checkout -b heroku-scheduler\n```\n\nUpdate the `.gitignore` and commit your changes\n```bash\n# remove\nconfig/prod.cfg\nconfig/client_secrets.json\nconfig/auth_token.json\n# add\ndev/\nconfig/dev.cfg\nconfig/prod_example.cfg\n```\n\nCreate, config and deploy the scheduler\n```bash\nheroku login\n# create a new app\nheroku create APP_NAME --region eu\n# or if you already have an existing app\nheroku git:remote -a APP_NAME\n\n# deploy your app\ngit push -u heroku heroku-scheduler:master\nheroku ps:scale clock=1\n\n# useful commands\nheroku ps\nheroku logs --ps clock.1\nheroku logs --tail\nheroku run bash\n```\n\nUpdate `script/scheduler.py` with your own preferences.\n\nMore info about Heroku [Scheduler](https://devcenter.heroku.com/articles/scheduler), [Clock Processes](https://devcenter.heroku.com/articles/clock-processes-python), [Add-on](https://elements.heroku.com/addons/scheduler) and [APScheduler](http://apscheduler.readthedocs.io/en/latest/userguide.html)\n\n### Docker\n\nBuild your image\n```\ndocker build -t niqdev/packtpub-crawler:2.4.0 .\n```\n\nRun manually\n```\ndocker run \\\n  --rm \\\n  --name my-packtpub-crawler \\\n  niqdev/packtpub-crawler:2.4.0 \\\n  python script/spider.py --config config/prod.cfg\n```\n\nRun scheduled crawler in background\n```\ndocker run \\\n  --detach \\\n  --name my-packtpub-crawler \\\n  niqdev/packtpub-crawler:2.4.0\n\n# useful commands\ndocker exec -i -t my-packtpub-crawler bash\ndocker logs -f my-packtpub-crawler\n```\n\nAlternatively you can pull from [Docker Hub](https://hub.docker.com/r/kuchy/packtpub-crawler/) this [fork](https://github.com/kuchy/packtpub-crawler/tree/docker_cron)\n```\ndocker pull kuchy/packtpub-crawler\n```\n\n### Cron job\nAdd this to your crontab to run the job daily at 9 AM:\n```\ncrontab -e\n\n00 09 * * * cd PATH_TO_PROJECT/packtpub-crawler \u0026\u0026 /usr/bin/python script/spider.py --config config/prod.cfg \u003e\u003e /tmp/packtpub.log 2\u003e\u00261\n```\n\n\n### Systemd service\nCreate two files in /etc/systemd/system:\n\n1. packtpub-crawler.service\n```\n[Unit]\nDescription=run packtpub-crawler\n\n[Service]\nUser=USER_THAT_SHOULD_RUN_THE_SCRIPT\nExecStart=/usr/bin/python2.7 PATH_TO_PROJECT/packtpub-crawler/script/spider.py -c config/prod.cfg\n\n[Install]\nWantedBy=multi-user.target\n```\n\n2. packtpub-crawler.timer\n```\n[Unit]\nDescription=Runs packtpub-crawler every day at 7\n\n[Timer]\nOnBootSec=10min\nOnActiveSec=1s\nOnCalendar=*-*-* 07:00:00\nUnit=packtpub_crawler.service\nPersistent=true\n\n[Install]\nWantedBy=multi-user.target\n```\n\nEnable the script with ```sudo systemctl enable packtpub_crawler.timer```.\nYou can test the service with ```sudo systemctl start packtpub_crawler.timer``` and see the output with ```sudo journalctl -u packtpub_crawler.service -f```.\n\n\n### Newsletter\nThe script downloads also the free ebooks from the weekly packtpub newsletter.\nThe [URL](https://goo.gl/kUciut) is generated by a Google Apps Script which parses all the mails.\nYou can get the code [here](https://gist.github.com/juzim/af0ef80f1233de51614d88551514b0ad), if you want to see the actual script, please clone the [spreadsheet](https://docs.google.com/spreadsheets/d/1jN5gV45uVkE0EEF4Nb-yVNfIr3o8OoiVveUZJRMiLFw) and go to `Tools \u003e Script editor...`.\n\nTo use your own source, modify in the config\n```\nurl.bookFromNewsletter=https://goo.gl/kUciut\n```\n\nThe URL should point to a file containing only the URL (no semicolons, HTML, JSON, etc).\n\nYou can also clone the [spreadsheet](https://docs.google.com/spreadsheets/d/1jN5gV45uVkE0EEF4Nb-yVNfIr3o8OoiVveUZJRMiLFw) to use your own Gmail account. Subscribe to the [newsletter](https://www.packtpub.com) (on the bottom of the page) and create a filter to tag your mails accordingly.\n\n\n### Troubleshooting\n* ImportError: No module named paramiko\n\nInstall paramiko with `sudo -H pip install paramiko --ignore-installed`\n\n* Failed building wheel for cryptography\n\nInstall missing dependencies as described [here](https://cryptography.io/en/latest/installation/#building-cryptography-on-windows)\n\n### virtualenv\n\n```\n# install pip + setuptools\ncurl https://bootstrap.pypa.io/get-pip.py | python -\n\n# upgrade pip\npip install -U pip\n\n# install virtualenv globally \nsudo pip install virtualenv\n\n# create virtualenv\nvirtualenv env\n\n# activate virtualenv\nsource env/bin/activate\n\n# verify virtualenv\nwhich python\npython --version\n\n# deactivate virtualenv\ndeactivate\n```\n\n### Development (only for spidering)\nRun a simple static server with\n```\nnode dev/server.js\n```\nand test the crawler with\n```\npython script/spider.py --dev --config config/dev.cfg --all\n```\n\n### Disclaimer\n\nThis project is just a Proof of Concept and not intended for any illegal usage. I'm not responsible for any damage or abuse, use it at your own risk.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fniqdev%2Fpacktpub-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fniqdev%2Fpacktpub-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fniqdev%2Fpacktpub-crawler/lists"}