Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/wenyalintw/google-patents-scraper
Automatically download all PDF files of searching results & their patent families found on Google Patents.
https://github.com/wenyalintw/google-patents-scraper
crawler google-patents patent patents pdf scraper scraping scrapy web-scraping
Last synced: 2 months ago
JSON representation
Automatically download all PDF files of searching results & their patent families found on Google Patents.
- Host: GitHub
- URL: https://github.com/wenyalintw/google-patents-scraper
- Owner: wenyalintw
- License: mit
- Created: 2019-08-10T13:07:01.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T06:00:07.000Z (about 2 years ago)
- Last Synced: 2023-03-10T13:41:04.768Z (almost 2 years ago)
- Topics: crawler, google-patents, patent, patents, pdf, scraper, scraping, scrapy, web-scraping
- Language: Python
- Homepage: https://wenyalintw.github.io/project/google-patents-scraper/
- Size: 54.9 MB
- Stars: 35
- Watchers: 3
- Forks: 14
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Google Patents Scraper
(1) Automatically download all PDF files of searching results & their patent families.
(2) Generate an overview report of searching results.
## Table of contents
* [Application Demo](#application-demo)
* [Introduction](#introduction)
* [Built With](#built-with)
* [Getting Started](#getting-started)
* [Acknowledgments](#acknowledgments)## Application Demo
### [Google Patents Scraper – Demo (YouTube)](https://youtu.be/HRl3ChPxbIo)## Introduction
This application scrape Google Patents by two steps:* Set Proxy (Optional)
* Search & Download Patents### Set Proxy (Optional)
* Set proxy to avoid current ip blocked by Google Patents
### Search & Download Patents
* Select an output directory to store downloaded/generated files
* Search whatever you like (search terms' format same as Google Patents)
* Download PDF files of searching results & their patent familiesPDF files and auto-generated `overview.md` will then be stored in selected directory
### File Structure of Output Directory
├── PDFs
│ ├── CN104321947A.pdf
│ ├── ...
│ └── readme.txt
├── Family_PDFs
│ ├── CN104321947A's\ Family
│ │ ├── EP2850716B1.pdf
│ │ ├── ...
│ │ └── readme.txt
│ ├── ...
│ └── ...
└── overview.md
* Output directory of demo located at [Demo_outdir](https://github.com/wenyalintw/Google-Patents-Scraper/tree/master/Demo_outdir)
* [overview.md](https://github.com/wenyalintw/Google-Patents-Scraper/blob/master/Demo_outdir/overview.md) represents the summary of completed searching## Built With
Modules besides python built-ins* Web Scarping - [Selenium](https://www.seleniumhq.org/) / [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) / [requests](https://2.python-requests.org//en/master/)
* GUI framework - [PyQt5](https://pypi.org/project/PyQt5/)
* Others - [fake-useragent](https://github.com/hellysmile/fake-useragent) / [tqdm](https://pypi.org/project/tqdm/)## Getting Started
### Prerequisites
* Download a [ChromeDriver](https://chromedriver.chromium.org/) which correspond with your Chrome version
* Replace the one in [src/resources](https://github.com/wenyalintw/Google-Patents-Scraper/tree/master/src/resources)### Installation
* Clone the repo
```sh
git clone https://github.com/wenyalintw/Google-Patents-Scraper.git
```* Install required modules listed in [requirements.txt](https://github.com/wenyalintw/Google-Patents-Scraper/blob/master/requirements.txt)
```sh
pip install -r /path/to/requirements.txt
```* Ready to go
```sh
cd src
python main.py
```## Acknowledgments
- Checking process of proxies modified from [ApsOps's repo](https://github.com/ApsOps/proxy-checker)
- [search.png](https://github.com/wenyalintw/Google-Patents-Scraper/blob/master/src/resources/iconfinder_search_461380.png) licensed under "CC BY 3.0" downloaded from [ICONFINDER](https://www.iconfinder.com/icons/1609653/brain_organs_icon)###### MIT License (2019), Wen-Ya Lin