{"id":28331203,"url":"https://github.com/olivovar/goodreads-scraper","last_synced_at":"2026-04-25T23:35:05.394Z","repository":{"id":293325325,"uuid":"983686298","full_name":"olivovar/Goodreads-Scraper","owner":"olivovar","description":"Scrapes Goodreads book data and user reviews for research and analysis.","archived":false,"fork":false,"pushed_at":"2025-05-22T16:59:31.000Z","size":39274,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-03T05:28:14.173Z","etag":null,"topics":["automation","beautifulsoup","data-science","goodreads","python","research-tool","selenium","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/olivovar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-14T18:56:32.000Z","updated_at":"2025-05-22T16:59:35.000Z","dependencies_parsed_at":"2025-05-14T19:56:15.111Z","dependency_job_id":null,"html_url":"https://github.com/olivovar/Goodreads-Scraper","commit_stats":null,"previous_names":["olivovar/goodreads-scraper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/olivovar/Goodreads-Scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olivovar%2FGoodreads-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olivovar%2FGoodreads-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olivovar%2FGoodreads-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olivovar%2FGoodreads-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/olivovar","download_url":"https://codeload.github.com/olivovar/Goodreads-Scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/olivovar%2FGoodreads-Scraper/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260672099,"owners_count":23044739,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","beautifulsoup","data-science","goodreads","python","research-tool","selenium","web-scraping"],"created_at":"2025-05-26T18:28:56.555Z","updated_at":"2026-04-25T23:35:05.389Z","avatar_url":"https://github.com/olivovar.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Goodreads Scraper\n\nA Python-based scraper that collects book reviews and metadata from Goodreads using both static and dynamic scraping techniques. Built as part of a research co-op trial at the BACSS Lab, the scraper processes hundreds of books and tens of thousands of user reviews with reliability, efficiency, and ethical design.\n\n---\n\n## Features\n\n- Matches books from a list using fuzzy title-author comparison\n- Dynamically scrapes up to 100 user reviews per book\n- Extracts reviewer ID, rating, date, text, upvotes, comments, and tags\n- Saves output to a cumulative CSV file that supports pause/resume\n- Logs missing data, handles pagination, and deduplicates reviews\n- Resilient to login timeouts, missing buttons, and dynamic content loading\n\n---\n\n## Project Structure\n\n```\nGoodreads-Scraper/\n├── scraperGoodreads.py        # Main scraping script\n├── data/\n│   ├── goodreads_list.csv     # Book list input\n│   └── reviews_output.csv     # Review data output\n├── docs/\n│   └── Goodreads_Scraper_Task.pdf  # Methodology write-up\n├── requirements.txt\n├── .gitignore\n└── README.md\n```\n\n---\n\n## Getting Started\n\n### 1. Clone the repo\n\n```bash\ngit clone https://github.com/olivovar/Goodreads-Scraper.git\ncd Goodreads-Scraper\n```\n\n### 2. Install dependencies\n\n```bash\npip install -r requirements.txt\n```\n\n### 3. Prepare input data\n\nUpdate `data/goodreads_list.csv` with your list of books (title, author, book ID).\n\n### 4. Run the scraper\n\n```bash\npython scraperGoodreads.py\n```\n\nReviews will be saved incrementally in `data/reviews_output.csv`.\n\n---\n\n## Methodology Summary\n\nThe scraper uses **fuzzy string matching** (via `rapidfuzz`) to identify the correct Goodreads page for each book. It filters out irrelevant results (e.g., summaries or guides) and navigates through paginated reviews using **Selenium** with waits and scroll events to handle dynamic content.\n\nProgress is saved after every book, allowing the script to resume seamlessly even after interruptions. Duplicate reviews are avoided using reviewer IDs, and books that cannot be matched or scraped are logged and skipped.\n\nFor more detail, see the full [methodology write-up](docs/Goodreads_Scraper_Task.pdf).\n\n---\n\n## GenAI Usage\n\nGenerative AI (ChatGPT) was used in early stages to:\n- Interpret dynamic HTML structure and identify selectors\n- Refine architectural strategies for robustness and resumability\n- Debug Selenium/browser session issues\n\nAs development progressed, AI was used less for low-level scraping and more as a second-opinion tool for improving reliability, efficiency, and design.\n\n---\n\n## Future Improvements\n\n- Parallelization for faster scraping\n- Enhanced exception logging and summary reporting\n- Support for scraping ratings without reviews\n\n---\n\n## Author\n\n**Olivia Pivovar**  \n📍 Boston, MA  \n[GitHub](https://github.com/olivovar) | [LinkedIn](https://linkedin.com/in/oliviapivovar)\n\n---\n\n## ⚠Disclaimer\n\nThis project is for **educational and research purposes only**. Always consult Goodreads' [robots.txt](https://www.goodreads.com/robots.txt) and terms of use before running any scraper.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Folivovar%2Fgoodreads-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Folivovar%2Fgoodreads-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Folivovar%2Fgoodreads-scraper/lists"}