{"id":18863897,"url":"https://github.com/jwarwick/patentcrawler","last_synced_at":"2025-06-30T03:06:29.532Z","repository":{"id":66704219,"uuid":"9520896","full_name":"jwarwick/PatentCrawler","owner":"jwarwick","description":"Web scraper for USPTO patent site","archived":false,"fork":false,"pushed_at":"2013-04-21T03:14:27.000Z","size":136,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-20T09:12:36.124Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jwarwick.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.txt","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-04-18T12:15:53.000Z","updated_at":"2018-07-22T21:26:31.000Z","dependencies_parsed_at":"2023-02-20T09:45:24.775Z","dependency_job_id":null,"html_url":"https://github.com/jwarwick/PatentCrawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwarwick%2FPatentCrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwarwick%2FPatentCrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwarwick%2FPatentCrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jwarwick%2FPatentCrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jwarwick","download_url":"https://codeload.github.com/jwarwick/PatentCrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":255887566,"owners_count":22303840,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T04:39:01.497Z","updated_at":"2025-06-30T03:06:29.510Z","avatar_url":"https://github.com/jwarwick.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"*Note: This is legacy code that almost certainly doesn't scrape the USPTO site correctly anymore. Only storing here in case the basic code is useful to someone.*\r\n\r\n# PatentCrawler v1.5.11 README\r\nJohn Warwick, 16 June 2006\r\n\r\nPatentCrawler is a tool designed to allow researchers to download patents from the [USPTO](http://patft.uspto.gov/netahtml/PTO/search-adv.htm) website and export relevant fields.\r\n\r\n## *IMPORTANT WARNING*\r\nThe USPTO website is a public resource and they activley ban IP-blocks that place too much strain\r\non their servers.  To this end, PatentCrawler is limited in the number of patents per day it will\r\nattempt to download.  From their website, I infer that downloading less than 1000 patents per day\r\nwill keep you in the safe range.  However, if other users at your site are accessing the website,\r\nthis may count towards your daily total.  Use this program at your own risk; I do not know how to\r\nget you off of the blacklist.\r\n\r\n## Configuration\r\nThe first time you run PatentCrawler you should set two variables, available under the _Configuration_\r\ntab.  First, create an empty folder and specify the path to this folder as your Cache Path (using the\r\n_Set_ button).  This folder will store the raw html of any patents you download.  The cache is designed\r\nto hold files from multiple search sets, allowing you to significantly speed up your searches if they\r\ncontain overlapping patent numbers.  The cache is also used to generate exported data.  Be sure to\r\nselect this path in a location that will not be delete.  Next, set the number of patents per day\r\nthat PatentCrawler will attempt to download.  This setting only signifies the rate at which patents\r\nwill be downloaded, it does not keep a static counter across invocations of the application.  The \r\ndefault is 500.\r\n\r\n## Usage\r\nTo use PatentCrawler, enter a search string (the same format as used on the USPTO Advanced Search site),\r\nand check or uncheck the _Add Referenced By Patents_ checkbox (if you wish to include all US Patents\r\nthat reference a patent in the search results) and the _Add US References_ checkbox (if you wish to\r\ninclude all US patents cited by a patent in the search results).  The click the _Search_ button.\r\nAfter all of the search results are downloaded and parsed, the Search Set field of the application will\r\nupdate indicating the number of patents returned.  \r\n\r\nNext, press the _Start_ button.  PatentCrawler will now begin downloading patents from the USPTO site and placing them in the\r\ncache.  The time until the next download and the total estimated time are displayed, as well as any\r\nstatus and error messages that may be generated by the download.  You may pause the downloading at any\r\ntime by pressing the _Stop_ button.  To restart, press _Start_ again.  The result pages that were imported\r\nmay be saved as a Search Set.  Choose _File-\u003eSave_ to specify a file to hold the list of imported patent\r\nnumbers.  You may re-open saved searches using the _File-\u003eOpen_ command.\r\n\r\nIf the _Add Referenced By Patents_ checkbox is selected, PatentCrawler will store each patent number in\r\nthe starting search set in a separate list.  This size of this list is displayed in the _Remaining\r\nreferences_ field.  Only after all the patents in the initial search set and all of the citations in \r\nthose patents (if the _Add US References_ checkbox is selected) are downloaded, then the referenced by\r\npatents will initiate their search.  As more patents are discovered, they are added to the patents\r\nremaining list, and will be processed before more referenced by searches are carried out.\r\n\r\n## Export\r\nWhen you have downloaded at least one patent into the cache, the _Export_ button in the _Export_ tab \r\nbecomes active.  Pressing this button traverses the list of downloaded patents in the current search\r\nset and opens those files from the cache.  From these files, the fields specified in the _Export_ tab are\r\nwritten to a tab-delimited file which the user selects from a file dialog.\r\n\r\nIn addition, there is an _Export Special_ button which generates a report-style output file.  This file\r\nis not configurable from the application.\r\n\r\n## Known Bugs\r\n* Not storing error generating http requests in a separate list, could loop endlessly\r\n* Can't differentiate between Delaware and Germany when exporting patents\r\n\r\n\r\n\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjwarwick%2Fpatentcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjwarwick%2Fpatentcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjwarwick%2Fpatentcrawler/lists"}