{"id":31782079,"url":"https://github.com/echou0723/threads-","last_synced_at":"2026-05-14T20:07:12.651Z","repository":{"id":315347660,"uuid":"1059106658","full_name":"EChou0723/Threads-","owner":"EChou0723","description":"Python scraper for Threads content with anti-blocking and auto-retry features","archived":false,"fork":false,"pushed_at":"2025-09-18T03:15:32.000Z","size":6,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-18T05:34:36.970Z","etag":null,"topics":["data-collection","python","scraping-websites","selenium","threads"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EChou0723.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-18T02:23:56.000Z","updated_at":"2025-09-18T03:17:35.000Z","dependencies_parsed_at":"2025-09-18T19:15:37.914Z","dependency_job_id":null,"html_url":"https://github.com/EChou0723/Threads-","commit_stats":null,"previous_names":["echou0723/threads-"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/EChou0723/Threads-","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EChou0723%2FThreads-","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EChou0723%2FThreads-/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EChou0723%2FThreads-/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EChou0723%2FThreads-/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EChou0723","download_url":"https://codeload.github.com/EChou0723/Threads-/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EChou0723%2FThreads-/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279003392,"owners_count":26083579,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-10T02:00:06.843Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-collection","python","scraping-websites","selenium","threads"],"created_at":"2025-10-10T09:14:24.130Z","updated_at":"2025-10-10T09:14:33.008Z","avatar_url":"https://github.com/EChou0723.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Threads Content Scraper 🕸️\n\n一個用於抓取 Threads 社群平台內容的 Python 爬蟲工具，專門設計用來收集和分析投資相關貼文內容。\n\n## ✨ 專案特色\n\n- 🔐 **智能登入檢測** - 自動偵測登入狀態並提示手動登入\n- 🛡️ **防反爬機制** - 隨機延時、User-Agent 偽裝、分批休息\n- 🔄 **自動補抓系統** - 智能檢測異常內容並生成補抓清單\n- 📊 **品質控制** - 自動去重、異常檢測、成功率統計\n- 🤖 **NotebookLM 整合** - 直接輸出適合 AI 分析的格式\n\n## 📊 專案成果\n\n- **總抓取量**: 447 篇貼文\n- **成功率**: 87.7%\n- **有效內容**: 383 篇\n- **總字數**: 33 萬字\n- **平均長度**: 1,609 字元/篇\n\n## 🚀 快速開始\npip install -r requirements.txt\n\n### 環境需求\npip install -r requirements.txt\n\n### 基本使用\npython src/threads_batch_spider.py\n\n### 步驟說明\n\n1. 程式會自動開啟 Chrome 瀏覽器\n2. 手動登入 Threads 帳號\n3. 確認登入成功後按 Enter\n4. 程式自動開始抓取內容\n5. 完成後輸出 CSV 檔案\n\n## 📁 檔案說明\n\n### 核心腳本\n- `threads_batch_spider.py` - 完整版爬蟲（推薦使用）\n- `threads_post_content_crawler.py` - 簡化版爬蟲\n\n### 資料檔案\n- `all_threads_urls_first_crawl.csv` - 所有貼文網址清單\n- `make_investment_easy_threads_main_posts.csv` - 原始抓取結果\n- `make_investment_easy_threads_main_posts_clean.csv` - 清理後的資料\n- `threads_for_notebooklm.csv` - NotebookLM 專用格式\n- `threads_urls_to_refetch.csv` - 需要補抓的網址清單\n\n### 輸出格式\n- `Bai-Hua-Tou-Zi-ThreadsNei-Rong.txt` - 純文字格式，適合 AI 分析\n\n## 🔧 核心功能\n\n### 1. 智能登入檢測\nif \"登入\" in driver.page_source or \"Login\" in driver.page_source:\nprint(f\"[ERROR] 需重新登入: {url}\")\nreturn \"LOGIN_REQUIRED\"\n\n### 2. 防反爬機制\ntime.sleep(random.uniform(10, 18)) # 隨機延時\nif (idx + 1) % 20 == 0:\ntime.sleep(120) # 每20篇休息2分鐘\n\n\n### 3. 自動補抓系統\ndef detect_empty_posts(input_csv, output_csv):\nerror_conditions = [\"\", \"LOGIN_REQUIRED\", \"NO_TEXT_FOUND\", \"nan\"]\nempty_urls = df[df[\"post_text_and_replies\"].isin(error_conditions)]\nempty_urls.to_csv(output_csv)\n\n\n## 📈 品質控制\n\n### 成功率統計\n- ✅ 有效內容: 392 篇 (87.7%)\n- ❌ 空白內容: 48 篇\n- ❌ 登入失效: 7 篇\n\n### 內容品質\n- 平均長度: 1,609 字元\n- 最長內容: 6,853 字元\n- 重複內容: \u003c 2%\n\n## 🤖 NotebookLM 整合\n\n處理後的資料可直接上傳至 Google NotebookLM：\n\n1. 使用 `threads_for_notebooklm.csv` 或 `Bai-Hua-Tou-Zi-ThreadsNei-Rong.txt`\n2. 上傳至 NotebookLM 建立知識庫\n3. 享受 AI 問答功能\n\n範例問題：\n- \"白話投資對技術分析的觀點是什麼？\"\n- \"整理所有關於選擇權的討論\"\n- \"分析風險管理的核心要點\"\n\n## ⚠️ 常見問題\n\n### 空白 CSV 問題\n**原因**: 未在 Selenium 開啟的瀏覽器內登入\n**解決**: 等待提示後在自動開啟的瀏覽器視窗內手動登入\n\n### HTTP 500 錯誤\n**原因**: 觸發反爬機制或 IP 被暫時封鎖\n**解決**: 增加延時間隔或更換網路環境\n\n### 內容重複\n**原因**: Threads 分享機制或頁面結構異常\n**解決**: 使用內建的去重功能\n\n詳細疑難排解請參考 [troubleshooting.md](docs/troubleshooting.md)\n\n## 🛠️ 客製化設定\n\n### 修改目標帳號\nusername = \"your_target_account\"\nprofile_url = \"https://www.threads.com/@your_target_account\"\n\n\n### 調整延時設定\ntime.sleep(random.uniform(5, 10)) # 減少延時\ntime.sleep(random.uniform(15, 25)) # 增加延時\n\n\n## 📊 資料分析建議\n\n1. **內容分析**: 使用 NLP 技術分析主題分布\n2. **情感分析**: 追蹤市場情緒變化\n3. **時間序列**: 分析觀點演變趨勢\n4. **知識圖譜**: 建立概念關聯網路\n\n## 🤝 貢獻指南\n\n歡迎提交 Issue 和 Pull Request！\n\n1. Fork 專案\n2. 建立功能分支\n3. 提交變更\n4. 發起 Pull Request\n\n## 📄 授權條款\n\nMIT License - 詳見 [LICENSE](LICENSE) 檔案\n\n## 🙏 致謝\n\n- [Selenium](https://selenium-python.readthedocs.io/) - Web 自動化框架\n- [Pandas](https://pandas.pydata.org/) - 資料處理工具\n- [白話投資](https://www.threads.com/@make_investment_easy) - 內容來源\n\n---\n\n⭐ 如果這個專案對你有幫助，請給個星星支持！\n\n📋 requirements.txt\nselenium\u003e=4.15.0\npandas\u003e=2.0.0\nrequests\u003e=2.31.0\nbeautifulsoup4\u003e=4.12.0\nlxml\u003e=4.9.0\n\n🚫 .gitignore\n# Python\n__pycache__/\n*.py[cod]\n*$py.class\n*.so\n.Python\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\nwheels/\nshare/python-wheels/\n*.egg-info/\n.installed.cfg\n*.egg\nMANIFEST\n\n# Virtual Environment\nvenv/\nenv/\nENV/\n\n# IDE\n.vscode/\n.idea/\n*.swp\n*.swo\n\n# OS\n.DS_Store\nThumbs.db\n\n# Browser drivers\nchromedriver*\ngeckodriver*\n\n# Large data files (optional)\ndata/raw/*.csv\ndata/processed/*.csv\n*.txt\n\n# Logs\n*.log\n\n# Credentials (if any)\n.env\ncredentials.json\n\n🔧 docs/troubleshooting.md\n# 疑難排解指南\n\n## 常見問題與解決方案\n\n### 1. 空白 CSV 檔案\n\n#### 問題描述\n程式執行完成，但 CSV 檔案中只有網址，內容欄位全部空白。\n\n#### 根本原因\n- Selenium 開啟的是全新瀏覽器 session\n- 與本機已登入的瀏覽器完全隔離\n- 程式以「未登入」狀態訪問頁面\n\n#### 解決方案\n1. 等待程式提示「請先登入...」\n2. 在**自動開啟的瀏覽器視窗**內手動登入 Threads\n3. 確認能看到個人主頁後按 Enter\n4. 不要在本機其他瀏覽器視窗登入\n\n### 2. HTTP 500 錯誤\n\n#### 問題描述\n抓取過程中出現 `HTTP ERROR 500` 或被導向錯誤頁面。\n\n#### 可能原因\n- 請求頻率過高觸發反爬機制\n- IP 被暫時封鎖\n- 網站臨時維護\n\n#### 解決方案\n1. 增加延時間隔：`time.sleep(random.uniform(15, 30))`\n2. 更換網路環境（手機熱點）\n3. 稍後再試（通常 1-2 小時後自動解除）\n\n### 3. 內容重複問題\n\n#### 問題描述\n不同網址抓到相同的貼文內容。\n\n#### 可能原因\n- Threads 分享/轉發機制\n- 頁面結構動態變化\n- 選擇器抓取錯誤區域\n\n#### 解決方案\n使用內建去重功能：\ndf_clean = df.drop_duplicates(subset=['post_text_and_replies'], keep='first')\n\n### 4. Chrome Driver 問題\n\n#### 問題描述\n`selenium.common.exceptions.WebDriverException: 'chromedriver' executable needs to be in PATH`\n\n#### 解決方案\n1. 安裝 Chrome 瀏覽器\n2. 下載對應版本的 ChromeDriver\n3. 將 ChromeDriver 放入 PATH 或專案目錄\n4. 或使用 webdriver-manager 自動管理：\nfrom webdriver_manager.chrome import ChromeDriverManager\nfrom selenium.webdriver.chrome.service import Service\n\nservice = Service(ChromeDriverManager().install())\ndriver = webdriver.Chrome(service=service, options=options)\n\n### 5. 記憶體不足\n\n#### 問題描述\n處理大量資料時記憶體不足或程式崩潰。\n\n#### 解決方案\n1. 分批處理：\nbatch_size = 50\nfor i in range(0, len(url_list), batch_size):\nbatch = url_list[i:i+batch_size]\n# 處理批次\n\n2. 及時釋放記憶體：\ndf = None # 釋放大型 DataFrame\ngc.collect() # 強制垃圾回收\n\n### 6. 編碼問題\n\n#### 問題描述\nCSV 檔案中文顯示亂碼。\n\n#### 解決方案\n使用正確的編碼格式：\ndf.to_csv('output.csv', encoding='utf-8-sig', index=False)\n\n## 偵錯技巧\n\n### 1. 啟用詳細日誌\nimport logging\nlogging.basicConfig(level=logging.DEBUG)\n\n### 2. 保存錯誤頁面\nif \"error\" in driver.page_source:\ndriver.save_screenshot(\"error_page.png\")\nwith open(\"error_page.html\", \"w\") as f:\nf.write(driver.page_source)\n### 3. 逐步偵錯\nprint(f\"當前網址: {driver.current_url}\")\nprint(f\"頁面標題: {driver.title}\")\nprint(f\"找到區塊數: {len(blocks)}\")\n## 效能優化\n\n### 1. 減少不必要的元素載入\noptions.add_argument(\"--disable-images\")\noptions.add_argument(\"--disable-css\")\n### 2. 使用無頭模式（測試用）\noptions.add_argument(\"--headless\")\n### 3. 設定頁面載入策略\noptions.add_argument(\"--page-load-strategy=eager\")\n📁 最終檔案清單\n必要檔案：\nREADME.md - 專案說明\n\nrequirements.txt - 相依套件\n\n.gitignore - 忽略檔案清單\n\nsrc/threads_batch_spider.py - 主要爬蟲腳本\n\ndocs/troubleshooting.md - 疑難排解\n\n範例資料檔案（可選）：\ndata/raw/all_threads_urls_first_crawl.csv - 網址清單範例\n\ndata/processed/threads_for_notebooklm.csv - 處理後資料範例\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fechou0723%2Fthreads-","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fechou0723%2Fthreads-","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fechou0723%2Fthreads-/lists"}