{"id":14977325,"url":"https://github.com/afuntw/python-crawling-tutorial","last_synced_at":"2025-10-28T03:31:14.057Z","repository":{"id":21312409,"uuid":"92278128","full_name":"afunTW/Python-Crawling-Tutorial","owner":"afunTW","description":"Python crawling tutorial","archived":false,"fork":false,"pushed_at":"2023-02-08T03:56:20.000Z","size":117839,"stargazers_count":62,"open_issues_count":3,"forks_count":25,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-02-01T10:51:09.514Z","etag":null,"topics":["crawling","ipynb-jupyter-notebook","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/afunTW.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-05-24T10:03:35.000Z","updated_at":"2024-03-03T09:09:56.000Z","dependencies_parsed_at":"2024-09-11T12:33:31.060Z","dependency_job_id":"9bbcd493-e889-46cd-aaf2-827092ce657d","html_url":"https://github.com/afunTW/Python-Crawling-Tutorial","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afunTW%2FPython-Crawling-Tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afunTW%2FPython-Crawling-Tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afunTW%2FPython-Crawling-Tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afunTW%2FPython-Crawling-Tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/afunTW","download_url":"https://codeload.github.com/afunTW/Python-Crawling-Tutorial/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238590593,"owners_count":19497351,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","ipynb-jupyter-notebook","python"],"created_at":"2024-09-24T13:55:27.652Z","updated_at":"2025-10-28T03:31:08.829Z","avatar_url":"https://github.com/afunTW.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Python-Crawling-Tutorial 基礎爬蟲實戰\n\n## 相關資源\n\n最新的投影片放在 [slideshare](https://www.slideshare.net/ChenMingYang/python-crawling-tutorial-87165481) 上, 會不定期更新, 程式碼可透過這個頁面右邊的 **Clone or download** 下載\n![demo](https://user-images.githubusercontent.com/4820492/35319787-585ea0c4-011c-11e8-802a-02ae0dbc4044.png)\n\n\u003e 2017 年以前的投影片教材放在 [release](https://github.com/afunTW/Python-Crawling-Tutorial/releases), 但是部份實戰練習網站會失效\n\u003e 或是可透過 [link](https://goo.gl/CFR95x) 下載投影片\n\n## 安裝環境\n\n### Anaconda (建議)\n\n- 下載 Python 3.6 版本 https://www.continuum.io/downloads\n- 練習題會使用到瀏覽器 Chrome，麻煩各位選擇自己電腦的平台安裝 [Chrome](https://www.google.com.tw/chrome/browser/desktop/index.html)\n- 動態網站的爬蟲也需要下載 webdriver，需要額外下載\n    - [Chrome](https://sites.google.com/a/chromium.org/chromedriver/downloads)\n    - [Firefox](https://github.com/mozilla/geckodriver/releases)\n- 題目都是以 `jupyter notebook` 進行，安裝完 Anaconda 後即可用內建 `jupyter notebook` 打開 `.ipynb` 檔\n- 建議安裝 Anaconda，如有安裝 Anaconda 只需安裝以下套件\n\n```sh\n$ pip install selenium tldextract Pillow\n```\n\n### pip\n\npip 是 Python 的套件管理系統，在部份系統裏面會用 `pip3` 代表 Python3 的版本，請各位依照自己的系統安裝 pip3 後，安裝以下 Python3 版本的套件\n\n```sh\n# 視情況而定, 使用 pip 或是 pip3\n$ pip install requests beautifulsoup4 lxml Pillow selenium tldextract\n```\n\n#### Optional: 資料分析\n\n沒有練習題但會有範例 code 可以執行，可自行選擇是否安裝 (如果安裝 wordcloud 時有問題，可能是沒有下載 visual studio，可以從 warining 中提供的網址下載安裝)\n\n```sh\n# Anaconda\n$ pip install jieba wordcloud\n\n# pip\n$ pip3 install numpy pandas matplotlib scipy scikit-learn jieba wordcloud\n```\n\n## 請遵守別人的規則\n\n有些網站會在目錄底下加上 robots.txt, 基本上這就是對方定義的爬蟲規則，請大家在練習爬蟲的時候要尊重對方的規則\n\n\u003e robots.txt 詳細的語法與用途請參考 [wiki](https://zh.wikipedia.org/zh-tw/Robots.txt) 與 [google 文件](https://support.google.com/webmasters/answer/6062608?hl=zh-Hant)\n\n---\n\n## Q\u0026A\n\n**Q: 有哪些常用的 API**\n\n課堂中有說到，爬蟲只是一種得到資料的手段，如果對方有提供 API 就可以直接使用 API，\nAPI 通常對方都會幫你整理好資料格式，或是根據權限決定你可以獲取的資料內容\n\n- [Facebook Graph API](https://developers.facebook.com/tools/explorer/)\n- [Youtube](https://www.youtube.com/yt/dev/zh-TW/api-resources.html)\n- [Yahoo YQL](https://developer.yahoo.com/yql/)\n- [Instagram](https://www.instagram.com/developer/)\n- [KKTIX](http://support.kktix.com/knowledgebase/articles/558918-%E6%B4%BB%E5%8B%95%E8%B3%87%E8%A8%8A-api)\n- [Google Maps API](https://developers.google.com/maps/?hl=zh-tw)\n- [Taipei Open Data API](http://data.taipei/opendata/developer)\n- [Imgur API](https://api.imgur.com/endpoints)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fafuntw%2Fpython-crawling-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fafuntw%2Fpython-crawling-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fafuntw%2Fpython-crawling-tutorial/lists"}