Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jwarwick/patentcrawler

Web scraper for USPTO patent site
https://github.com/jwarwick/patentcrawler

Last synced: 4 days ago
JSON representation

Web scraper for USPTO patent site

Awesome Lists containing this project

README

        

*Note: This is legacy code that almost certainly doesn't scrape the USPTO site correctly anymore. Only storing here in case the basic code is useful to someone.*

# PatentCrawler v1.5.11 README
John Warwick, 16 June 2006

PatentCrawler is a tool designed to allow researchers to download patents from the [USPTO](http://patft.uspto.gov/netahtml/PTO/search-adv.htm) website and export relevant fields.

## *IMPORTANT WARNING*
The USPTO website is a public resource and they activley ban IP-blocks that place too much strain
on their servers. To this end, PatentCrawler is limited in the number of patents per day it will
attempt to download. From their website, I infer that downloading less than 1000 patents per day
will keep you in the safe range. However, if other users at your site are accessing the website,
this may count towards your daily total. Use this program at your own risk; I do not know how to
get you off of the blacklist.

## Configuration
The first time you run PatentCrawler you should set two variables, available under the _Configuration_
tab. First, create an empty folder and specify the path to this folder as your Cache Path (using the
_Set_ button). This folder will store the raw html of any patents you download. The cache is designed
to hold files from multiple search sets, allowing you to significantly speed up your searches if they
contain overlapping patent numbers. The cache is also used to generate exported data. Be sure to
select this path in a location that will not be delete. Next, set the number of patents per day
that PatentCrawler will attempt to download. This setting only signifies the rate at which patents
will be downloaded, it does not keep a static counter across invocations of the application. The
default is 500.

## Usage
To use PatentCrawler, enter a search string (the same format as used on the USPTO Advanced Search site),
and check or uncheck the _Add Referenced By Patents_ checkbox (if you wish to include all US Patents
that reference a patent in the search results) and the _Add US References_ checkbox (if you wish to
include all US patents cited by a patent in the search results). The click the _Search_ button.
After all of the search results are downloaded and parsed, the Search Set field of the application will
update indicating the number of patents returned.

Next, press the _Start_ button. PatentCrawler will now begin downloading patents from the USPTO site and placing them in the
cache. The time until the next download and the total estimated time are displayed, as well as any
status and error messages that may be generated by the download. You may pause the downloading at any
time by pressing the _Stop_ button. To restart, press _Start_ again. The result pages that were imported
may be saved as a Search Set. Choose _File->Save_ to specify a file to hold the list of imported patent
numbers. You may re-open saved searches using the _File->Open_ command.

If the _Add Referenced By Patents_ checkbox is selected, PatentCrawler will store each patent number in
the starting search set in a separate list. This size of this list is displayed in the _Remaining
references_ field. Only after all the patents in the initial search set and all of the citations in
those patents (if the _Add US References_ checkbox is selected) are downloaded, then the referenced by
patents will initiate their search. As more patents are discovered, they are added to the patents
remaining list, and will be processed before more referenced by searches are carried out.

## Export
When you have downloaded at least one patent into the cache, the _Export_ button in the _Export_ tab
becomes active. Pressing this button traverses the list of downloaded patents in the current search
set and opens those files from the cache. From these files, the fields specified in the _Export_ tab are
written to a tab-delimited file which the user selects from a file dialog.

In addition, there is an _Export Special_ button which generates a report-style output file. This file
is not configurable from the application.

## Known Bugs
* Not storing error generating http requests in a separate list, could loop endlessly
* Can't differentiate between Delaware and Germany when exporting patents