https://github.com/wenyalintw/google-patents-scraper

Automatically download all PDF files of searching results & their patent families found on Google Patents.
https://github.com/wenyalintw/google-patents-scraper

crawler google-patents patent patents pdf scraper scraping scrapy web-scraping

Last synced: 2 months ago
JSON representation

Automatically download all PDF files of searching results & their patent families found on Google Patents.

Host: GitHub
URL: https://github.com/wenyalintw/google-patents-scraper
Owner: wenyalintw
License: mit
Created: 2019-08-10T13:07:01.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-12-08T06:00:07.000Z (over 2 years ago)
Last Synced: 2025-04-05T11:11:21.153Z (3 months ago)
Topics: crawler, google-patents, patent, patents, pdf, scraper, scraping, scrapy, web-scraping
Language: Python
Homepage: https://wenyalintw.github.io/project/google-patents-scraper/
Size: 54.9 MB
Stars: 63
Watchers: 3
Forks: 22
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


  

    

  

  
Google Patents Scraper

    

  	(1) Automatically download all PDF files of searching results & their patent families.

  

    

    (2) Generate an overview report of searching results.

  


## Table of contents

* [Application Demo](#application-demo)

* [Introduction](#introduction)

* [Built With](#built-with)

* [Getting Started](#getting-started)

* [Acknowledgments](#acknowledgments)

## Application Demo

### [Google Patents Scraper – Demo (YouTube)](https://youtu.be/HRl3ChPxbIo)

## Introduction

This application scrape Google Patents by two steps:

* Set Proxy (Optional)

* Search & Download Patents

### Set Proxy (Optional)

* Set proxy to avoid current ip blocked by Google Patents



    



### Search & Download Patents

* Select an output directory to store downloaded/generated files

* Search whatever you like (search terms' format same as Google Patents)

* Download PDF files of searching results & their patent families

PDF files and auto-generated `overview.md` will then be stored in selected directory



    



### File Structure of Output Directory


├── PDFs

│   ├── CN104321947A.pdf

│   ├── ...

│   └── readme.txt

├── Family_PDFs

│   ├── CN104321947A's\ Family

│   │   ├── EP2850716B1.pdf

│   │   ├── ...

│   │   └── readme.txt

│   ├── ...

│   └── ...

└── overview.md



* Output directory of demo located at [Demo_outdir](https://github.com/wenyalintw/Google-Patents-Scraper/tree/master/Demo_outdir)

* [overview.md](https://github.com/wenyalintw/Google-Patents-Scraper/blob/master/Demo_outdir/overview.md) represents the summary of completed searching

## Built With

Modules besides python built-ins

 * Web Scarping - [Selenium](https://www.seleniumhq.org/) / [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) / [requests](https://2.python-requests.org//en/master/)

 * GUI framework - [PyQt5](https://pypi.org/project/PyQt5/)

 * Others - [fake-useragent](https://github.com/hellysmile/fake-useragent) / [tqdm](https://pypi.org/project/tqdm/)

## Getting Started

### Prerequisites

* Download a [ChromeDriver](https://chromedriver.chromium.org/) which correspond with your Chrome version

* Replace the one in [src/resources](https://github.com/wenyalintw/Google-Patents-Scraper/tree/master/src/resources)

### Installation

* Clone the repo

```sh

git clone https://github.com/wenyalintw/Google-Patents-Scraper.git

```

* Install required modules listed in [requirements.txt](https://github.com/wenyalintw/Google-Patents-Scraper/blob/master/requirements.txt)

```sh

pip install -r /path/to/requirements.txt

```

* Ready to go

```sh

cd src

python main.py

```

## Acknowledgments

- Checking process of proxies modified from [ApsOps's repo](https://github.com/ApsOps/proxy-checker)

- [search.png](https://github.com/wenyalintw/Google-Patents-Scraper/blob/master/src/resources/iconfinder_search_461380.png) licensed under "CC BY 3.0" downloaded from [ICONFINDER](https://www.iconfinder.com/icons/1609653/brain_organs_icon)

###### MIT License (2019), Wen-Ya Lin

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wenyalintw/google-patents-scraper

Awesome Lists containing this project

README

Google Patents Scraper