Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/monperrus/crawler-user-agents
Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
https://github.com/monperrus/crawler-user-agents
Last synced: about 15 hours ago
JSON representation
Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star:
- Host: GitHub
- URL: https://github.com/monperrus/crawler-user-agents
- Owner: monperrus
- License: mit
- Created: 2014-03-07T20:44:56.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2024-10-23T05:56:48.000Z (3 months ago)
- Last Synced: 2024-10-29T15:01:54.949Z (3 months ago)
- Language: Go
- Homepage:
- Size: 507 KB
- Stars: 1,184
- Watchers: 44
- Forks: 254
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- favorite-link - 机器人/机器人/爬虫/刮刀/蜘蛛使用的 HTTP 用户代理的语法模式。
- awesome-hacking-lists - monperrus/crawler-user-agents - Syntactic patterns of HTTP user-agents used by bots / robots / crawlers / scrapers / spiders. pull-request welcome :star: (Python)
README
# crawler-user-agents
This repository contains a list of of HTTP user-agents used by robots, crawlers, and spiders as in single JSON file.
* NPM package:
* Go package:
* PyPi package:Each `pattern` is a regular expression. It should work out-of-the-box wih your favorite regex library.
If you use this project in a commercial product, [please sponsor it](https://github.com/sponsors/monperrus).
## Install
### Direct download
Download the [`crawler-user-agents.json` file](https://raw.githubusercontent.com/monperrus/crawler-user-agents/master/crawler-user-agents.json) from this repository directly.
### Javascript
crawler-user-agents is deployed on npmjs.com:
To use it using npm or yarn:
```sh
npm install --save crawler-user-agents
# OR
yarn add crawler-user-agents
```In Node.js, you can `require` the package to get an array of crawler user agents.
```js
const crawlers = require('crawler-user-agents');
console.log(crawlers);
```### Python
Install with `pip install crawler-user-agents`
Then:
```python
import crawleruseragents
if crawleruseragents.is_crawler("Googlebot/"):
# do something
```or:
```python
import crawleruseragents
indices = crawleruseragents.matching_crawlers("bingbot/2.0")
print("crawlers' indices:", indices)
print(
"crawler's URL:",
crawleruseragents.CRAWLER_USER_AGENTS_DATA[indices[0]]["url"]
)
```Note that `matching_crawlers` is much slower than `is_crawler`, if the given User-Agent does indeed match any crawlers.
### Go
Go: use [this package](https://pkg.go.dev/github.com/monperrus/crawler-user-agents),
it provides global variable `Crawlers` (it is synchronized with `crawler-user-agents.json`),
functions `IsCrawler` and `MatchingCrawlers`.Example of Go program:
```go
package mainimport (
"fmt""github.com/monperrus/crawler-user-agents"
)func main() {
userAgent := "Mozilla/5.0 (compatible; Discordbot/2.0; +https://discordapp.com)"isCrawler := agents.IsCrawler(userAgent)
fmt.Println("isCrawler:", isCrawler)indices := agents.MatchingCrawlers(userAgent)
fmt.Println("crawlers' indices:", indices)
fmt.Println("crawler's URL:", agents.Crawlers[indices[0]].URL)
}
```Output:
```
isCrawler: true
crawlers' indices: [237]
crawler' URL: https://discordapp.com
```## Contributing
I do welcome additions contributed as pull requests.
The pull requests should:
* contain a single addition
* specify a discriminant relevant syntactic fragment (for example "totobot" and not "Mozilla/5 totobot v20131212.alpha1")
* contain the pattern (generic regular expression), the discovery date (year/month/day) and the official url of the robot
* result in a valid JSON file (don't forget the comma between items)Example:
{
"pattern": "rogerbot",
"addition_date": "2014/02/28",
"url": "http://moz.com/help/pro/what-is-rogerbot-",
"instances" : ["rogerbot/2.3 example UA"]
}## License
The list is under a [MIT License](https://opensource.org/licenses/MIT). The versions prior to Nov 7, 2016 were under a [CC-SA](http://creativecommons.org/licenses/by-sa/3.0/) license.
## Related work
There are a few wrapper libraries that use this data to detect bots:
* [Voight-Kampff](https://github.com/biola/Voight-Kampff) (Ruby)
* [isbot](https://github.com/Hentioe/isbot) (Ruby)
* [crawlers](https://github.com/Olical/crawlers) (Clojure)
* [isBot](https://github.com/omrilotan/isbot) (Node.JS)Other systems for spotting robots, crawlers, and spiders that you may want to consider are:
* [Crawler-Detect](https://github.com/JayBizzle/Crawler-Detect) (PHP)
* [BrowserDetector](https://github.com/mimmi20/BrowserDetector) (PHP)
* [browscap](https://github.com/browscap/browscap) (JSON files)