https://github.com/promptapi/scraper-go
Golang wrapper for Prompt API's Scraper API
https://github.com/promptapi/scraper-go
api-marketplace api-wrapper css-selector css-selector-parser data-extraction golang golang-package image-scraper image-scraping promptapi scraper scraper-api web-scraper web-scraping
Last synced: 28 days ago
JSON representation
Golang wrapper for Prompt API's Scraper API
- Host: GitHub
- URL: https://github.com/promptapi/scraper-go
- Owner: promptapi
- License: mit
- Created: 2020-09-07T19:33:57.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2020-10-06T07:33:22.000Z (over 5 years ago)
- Last Synced: 2023-07-27T22:29:33.813Z (over 2 years ago)
- Topics: api-marketplace, api-wrapper, css-selector, css-selector-parser, data-extraction, golang, golang-package, image-scraper, image-scraping, promptapi, scraper, scraper-api, web-scraper, web-scraping
- Language: Go
- Homepage: https://promptapi.com/marketplace/description/scraper-api
- Size: 43.9 KB
- Stars: 3
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README


[](https://pkg.go.dev/github.com/promptapi/scraper-go)
[](https://goreportcard.com/report/github.com/promptapi/scraper-go)
[](https://travis-ci.org/promptapi/scraper-go)
# Prompt API - Scraper - Golang Package
`PromptAPI` struct is a simple golang wrapper for [scraper api][scraper-api]
with few more extra cream and sugar.
---
## Requirements
1. You need to signup for [Prompt API][promptapi-signup]
1. You need to subscribe [scraper api][scraper-api], test drive is **free!!!**
1. You need to set `PROMPTAPI_TOKEN` environment variable after subscription.
then;
```bash
$ go get -u github.com/promptapi/scraper-go
```
---
## Example Basic Usage
```go
// main.go
package main
import (
"fmt"
"log"
scraper "github.com/promptapi/scraper-go"
)
func main() {
s := new(scraper.PromptAPI)
params := &scraper.Params{
URL: "https://pypi.org/classifiers/",
Country: "EE",
}
extraHeaders := []*ExtraHeader{} // custom extra headers
result := new(scraper.Result)
err := s.Scrape(params, extraHeaders, result)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Length of incoming data: %d\n", len(result.Data))
fmt.Printf("Response headers: %v\n", result.Headers)
fmt.Printf("Content-Length: %v\n", result.Headers["Content-Length"])
fileSize, err := s.Save("/tmp/test.html", result)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Size of /tmp/test.html -> %d bytes\n", fileSize)
}
```
Run:
```bash
$ go run main.go
Length of incoming data: 321322
Response headers: map[Accept-Ranges:bytes Content-Length:321322 Content-Security-Policy:base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ *.fastly-insights.com sentry.io https://api.pwnedpasswords.com https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self'; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com https://cdn.ravenjs.com; style-src 'self' fonts.googleapis.com; worker-src *.fastly-insights.com Content-Type:text/html; charset=UTF-8 Date:Tue, 08 Sep 2020 19:10:24 GMT ETag:"1ea9p+Hscl37dEKelacPWw" Referrer-Policy:origin-when-cross-origin Strict-Transport-Security:max-age=31536000; includeSubDomains; preload Vary:Accept-Encoding, Cookie, Accept-Encoding X-Cache:MISS, HIT X-Cache-Hits:0, 1 X-Content-Type-Options:nosniff X-Frame-Options:deny X-Permitted-Cross-Domain-Policies:none X-Served-By:cache-bwi5127-BWI, cache-hhn4035-HHN X-Timer:S1599592224.395422,VS0,VE247 X-XSS-Protection:1; mode=block]
Content-Length: 321322
Size of /tmp/test.html -> 321322 bytes
```
You can add url parameters for extra operations. Valid parameters are:
- `AuthPassword`: for HTTP Realm auth password
- `AuthUsername`: for HTTP Realm auth username
- `Cookie`: URL Encoded cookie header.
- `Country`: 2 character country code. If you wish to scrape from an IP address of a specific country.
- `Referer`: HTTP referer header
- `Selector`: CSS style selector path such as `a.btn div li`. If `Selector` is
enabled, returning result will be collection of data and saved file will be
in `.json` format.
Example with `Selector`:
```go
// main.go
package main
import (
"fmt"
"log"
scraper "github.com/promptapi/scraper-go"
)
func main() {
s := new(scraper.PromptAPI)
params := &scraper.Params{
URL: "https://pypi.org/classifiers/",
Country: "EE",
Selector: "ul li button[data-clipboard-text]",
}
// add extra request headers
extraHeaders := []*ExtraHeader{
&ExtraHeader{
name: "X-Referer",
value: "https://www.google.com",
},
}
result := new(scraper.Result)
err := s.Scrape(params, extraHeaders, result)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Length of incoming data: %d\n", len(result.Data))
fmt.Printf("Length of extracted data: %d\n", len(result.DataSelector))
fmt.Printf("Response headers: %v\n", result.Headers)
fmt.Printf("Content-Length: %v\n", result.Headers["Content-Length"])
fileSize, err := s.Save("/tmp/test.json", result)
if err != nil {
log.Fatal(err)
}
fmt.Printf("Size of /tmp/test.json -> %d bytes\n", fileSize)
}
```
Run:
```bash
$ go run main.go
Length of incoming data: 0
Length of extracted data: 734
Response headers: map[Accept-Ranges:bytes Content-Length:321322 Content-Security-Policy:base-uri 'self'; block-all-mixed-content; connect-src 'self' https://api.github.com/repos/ *.fastly-insights.com sentry.io https://api.pwnedpasswords.com https://2p66nmmycsj3.statuspage.io; default-src 'none'; font-src 'self' fonts.gstatic.com; form-action 'self'; frame-ancestors 'none'; frame-src 'none'; img-src 'self' https://warehouse-camo.ingress.cmh1.psfhosted.org/ www.google-analytics.com *.fastly-insights.com; script-src 'self' www.googletagmanager.com www.google-analytics.com *.fastly-insights.com https://cdn.ravenjs.com; style-src 'self' fonts.googleapis.com; worker-src *.fastly-insights.com Content-Type:text/html; charset=UTF-8 Date:Tue, 08 Sep 2020 19:17:22 GMT ETag:"1ea9p+Hscl37dEKelacPWw" Referrer-Policy:origin-when-cross-origin Strict-Transport-Security:max-age=31536000; includeSubDomains; preload Vary:Accept-Encoding, Cookie, Accept-Encoding X-Cache:HIT, HIT X-Cache-Hits:1, 1 X-Content-Type-Options:nosniff X-Frame-Options:deny X-Permitted-Cross-Domain-Policies:none X-Served-By:cache-bwi5137-BWI, cache-bma1621-BMA X-Timer:S1599592641.178639,VS0,VE1512 X-XSS-Protection:1; mode=block]
Content-Length: 321322
Size of /tmp/test.json -> 173717 bytes
```
Let’s see `/tmp/test.json` file:
```json
[
"\n Copy\n\n",
"\n Copy\n\n",
"\n Copy\n\n",
"\n Copy\n\n",
"\n Copy\n\n",
"\n Copy\n\n",
"\n Copy\n\n",
,
,
,
,
,
]
```
---
## Development
Available rake tasks:
```bash
$ rake -T
rake default # Default task, show avaliable tasks
rake release:check # Do release check
rake release:publish[revision] # Publish project with revision: major,minor,patch, default: patch
rake serve_doc[port] # Run doc server
rake test[verbose] # Run tests
```
- Run tests: `rake test` or `rake test[-v]`
- Run doc server: `rake serve_doc` or `rake serve_doc[9000]`
Release package (*if you have write access*):
1. Commit your changes
1. Run `rake release:check`
1. If all goes ok, run `rake release:publish`
---
## License
This project is licensed under MIT
---
## Contributer(s)
* [Prompt API](https://github.com/promptapi) - Creator, maintainer
---
## Contribute
All PR’s are welcome!
1. `fork` (https://github.com/promptapi/scraper-go/fork)
1. Create your `branch` (`git checkout -b my-feature`)
1. `commit` yours (`git commit -am 'Add awesome features...'`)
1. `push` your `branch` (`git push origin my-feature`)
1. Than create a new **Pull Request**!
This project is intended to be a safe,
welcoming space for collaboration, and contributors are expected to adhere to
the [code of conduct][coc].
---
[scraper-api]: https://promptapi.com/marketplace/description/scraper-api
[promptapi-signup]: https://promptapi.com/#signup-form
[coc]: https://github.com/promptapi/scraper-go/blob/main/CODE_OF_CONDUCT.md