https://github.com/anaskhan96/soup

Web Scraper in Go, similar to BeautifulSoup
https://github.com/anaskhan96/soup

beautifulsoup go golang html-node web-scraper webscraper webscraping

Last synced: about 2 months ago
JSON representation

Web Scraper in Go, similar to BeautifulSoup

Host: GitHub
URL: https://github.com/anaskhan96/soup
Owner: anaskhan96
License: mit
Created: 2017-02-23T19:28:58.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2023-11-02T18:55:42.000Z (over 1 year ago)
Last Synced: 2025-04-11T06:13:50.137Z (3 months ago)
Topics: beautifulsoup, go, golang, html-node, web-scraper, webscraper, webscraping
Language: Go
Homepage:
Size: 99.6 KB
Stars: 2,195
Watchers: 35
Forks: 167
Open Issues: 22
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: license

Awesome Lists containing this project

my-awesome - anaskhan96/soup - node,web-scraper,webscraper,webscraping pushed_at:2023-11 star:2.2k fork:0.2k Web Scraper in Go, similar to BeautifulSoup (Go)
awesome-list - soup
awesome-golang-repositories - soup

README

        # soup

[![Build Status](https://travis-ci.org/anaskhan96/soup.svg?branch=master)](https://travis-ci.org/anaskhan96/soup)

[![GoDoc](https://godoc.org/github.com/anaskhan96/soup?status.svg)](https://pkg.go.dev/github.com/anaskhan96/soup)

[![Go Report Card](https://goreportcard.com/badge/github.com/anaskhan96/soup)](https://goreportcard.com/report/github.com/anaskhan96/soup)

**Web Scraper in Go, similar to BeautifulSoup**

*soup* is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Exported variables and functions implemented till now :

```go

var Headers map[string]string // Set headers as a map of key-value pairs, an alternative to calling Header() individually

var Cookies map[string]string // Set cookies as a map of key-value  pairs, an alternative to calling Cookie() individually

func Get(string) (string,error) {} // Takes the url as an argument, returns HTML string

func GetWithClient(string, *http.Client) {} // Takes the url and a custom HTTP client as arguments, returns HTML string

func Post(string, string, interface{}) (string, error) {} // Takes the url, bodyType, and payload as an argument, returns HTML string

func PostForm(string, url.Values) {} // Takes the url and body. bodyType is set to "application/x-www-form-urlencoded"

func Header(string, string) {} // Takes key,value pair to set as headers for the HTTP request made in Get()

func Cookie(string, string) {} // Takes key, value pair to set as cookies to be sent with the HTTP request in Get()

func HTMLParse(string) Root {} // Takes the HTML string as an argument, returns a pointer to the DOM constructed

func Find([]string) Root {} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned

func FindAll([]string) []Root {} // Same as Find(), but pointers to all occurrences returned

func FindStrict([]string) Root {} //  Element tag,(attribute key-value pair) as argument, pointer to first occurence returned with exact matching values

func FindAllStrict([]string) []Root {} // Same as FindStrict(), but pointers to all occurrences returned

func FindNextSibling() Root {} // Pointer to the next sibling of the Element in the DOM returned

func FindNextElementSibling() Root {} // Pointer to the next element sibling of the Element in the DOM returned

func FindPrevSibling() Root {} // Pointer to the previous sibling of the Element in the DOM returned

func FindPrevElementSibling() Root {} // Pointer to the previous element sibling of the Element in the DOM returned

func Children() []Root {} // Find all direct children of this DOM element

func Attrs() map[string]string {} // Map returned with all the attributes of the Element as lookup to their respective values

func Text() string {} // Full text inside a non-nested tag returned, first half returned in a nested one

func FullText() string {} // Full text inside a nested/non-nested tag returned

func SetDebug(bool) {} // Sets the debug mode to true or false; false by default

func HTML() {} // HTML returns the HTML code for the specific element

```

`Root` is a struct, containing three fields :

* `Pointer` containing the pointer to the current html node

* `NodeValue` containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode

* `Error` containing an error in a struct if one occurrs, else `nil` is returned. 

A detailed text explaination of the error can be accessed using the `Error()` function. A field `Type` in this struct of type `ErrorType` will denote the kind of error that took place, which will consist of either of the following

	* `ErrUnableToParse`

	* `ErrElementNotFound`

	* `ErrNoNextSibling`

	* `ErrNoPreviousSibling`

	* `ErrNoNextElementSibling`

	* `ErrNoPreviousElementSibling`

	* `ErrCreatingGetRequest`

	* `ErrInGetRequest`

	* `ErrReadingResponse`

## Installation

Install the package using the command

```bash

go get github.com/anaskhan96/soup

```

## Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from [xkcd](https://xkcd.com).

[More Examples](https://github.com/anaskhan96/soup/tree/master/examples)

```go

package main

import (

	"fmt"

	"github.com/anaskhan96/soup"

	"os"

)

func main() {

	resp, err := soup.Get("https://xkcd.com")

	if err != nil {

		os.Exit(1)

	}

	doc := soup.HTMLParse(resp)

	links := doc.Find("div", "id", "comicLinks").FindAll("a")

	for _, link := range links {

		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])

	}

}

```

## Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you think there should be a particular feature or function included in the package, feel free to open up a new issue or pull request.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anaskhan96/soup

Awesome Lists containing this project

README