Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/yhat/scrape

A simple, higher level interface for Go web scraping.
https://github.com/yhat/scrape

Last synced: 3 months ago
JSON representation

A simple, higher level interface for Go web scraping.

Host: GitHub
URL: https://github.com/yhat/scrape
Owner: yhat
License: bsd-2-clause
Created: 2015-05-18T18:20:30.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2016-11-28T14:46:10.000Z (over 7 years ago)
Last Synced: 2024-01-18T10:20:47.939Z (5 months ago)
Language: Go
Homepage:
Size: 9.77 KB
Stars: 1,503
Watchers: 42
Forks: 104
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-crawler - scrape - A simple, higher level interface for Go web scraping. (Go)
my-awesome-stars - yhat/scrape - A simple, higher level interface for Go web scraping. (Go)
awesome-stars - yhat/scrape - A simple, higher level interface for Go web scraping. (Go)
my-awesome-github-stars - yhat/scrape - A simple, higher level interface for Go web scraping. (Go)
awesome-stars - scrape - A simple, higher level interface for Go web scraping. (Go)
awesome-crawler - scrape - A simple, higher level interface for Go web scraping. (Go)
awesome-crawlers - scrape - 08-22 | A simple, higher level interface for Go web scraping. | (Go)
awesome-crawler-cn - scrape - 一个简单的提供很好开发接口的网络爬虫. (Go)

README

        # scrape

A simple, higher level interface for Go web scraping.

When scraping with Go, I find myself redefining tree traversal and other

utility functions.

This package is a place to put some simple tools which build on top of the

[Go HTML parsing library](https://godoc.org/golang.org/x/net/html).

For the full interface check out the godoc

[![GoDoc](https://godoc.org/github.com/yhat/scrape?status.svg)](https://godoc.org/github.com/yhat/scrape)

## Sample

Scrape defines traversal functions like `Find` and `FindAll` while attempting

to be generic. It also defines convenience functions such as `Attr` and `Text`.

```go

// Parse the page

root, err := html.Parse(resp.Body)

if err != nil {

    // handle error

}

// Search for the title

title, ok := scrape.Find(root, scrape.ByTag(atom.Title))

if ok {

    // Print the title

    fmt.Println(scrape.Text(title))

}

```

## A full example: Scraping Hacker News

```go

package main

import (

	"fmt"

	"net/http"

	"github.com/yhat/scrape"

	"golang.org/x/net/html"

	"golang.org/x/net/html/atom"

)

func main() {

	// request and parse the front page

	resp, err := http.Get("https://news.ycombinator.com/")

	if err != nil {

		panic(err)

	}

	root, err := html.Parse(resp.Body)

	if err != nil {

		panic(err)

	}

	// define a matcher

	matcher := func(n *html.Node) bool {

		// must check for nil values

		if n.DataAtom == atom.A && n.Parent != nil && n.Parent.Parent != nil {

			return scrape.Attr(n.Parent.Parent, "class") == "athing"

		}

		return false

	}

	// grab all articles and print them

	articles := scrape.FindAll(root, matcher)

	for i, article := range articles {

		fmt.Printf("%2d %s (%s)\n", i, scrape.Text(article), scrape.Attr(article, "href"))

	}

}

```