https://github.com/chxmbley/sitex

Read text content from websites ignoring styling, behavior, and structure
https://github.com/chxmbley/sitex

go golang parser reader text text-analysis web webscraper webscraping

Last synced: 29 days ago
JSON representation

Read text content from websites ignoring styling, behavior, and structure

Host: GitHub
URL: https://github.com/chxmbley/sitex
Owner: chxmbley
License: mit
Created: 2019-12-09T19:03:50.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2019-12-09T19:17:21.000Z (over 6 years ago)
Last Synced: 2026-01-14T18:25:03.565Z (5 months ago)
Topics: go, golang, parser, reader, text, text-analysis, web, webscraper, webscraping
Language: Go
Homepage:
Size: 4.88 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # jdchum/sitex

Package `jdchum/sitex` reads the text content from websites ignoring styling, behavior, and structure. This package can be used to search site text for key words and phrases as well as monitoring text for changes.

## Install

```sh

go get -u github.com/jdchum/sitex

```

## Example

```go

package main

import (

    "io/ioutil"

    "github.com/jdchum/sitex"

)

const url = "https://en.wikipedia.org/wiki/Go_(programming_language)"

func main() {

    // Get the site's text

    text, err := sitex.GetSiteText(url, " ")

    if err != nil {

        panic(err)

    }

    // Output the text to disk

    err = ioutil.WriteFile("out.txt", []byte(text), 0644)

    if err != nil {

        panic(err)

    }

}

```

## API

### `sitex.GetSiteText(url, sep string) (text string, err error)`

> Attempts to parse all human-readable text from a webpage. "Invisible" text such as HTML tags, JavaScript, and CSS are ignored.

* `url` - URL of the webpage to fetch and parse

* `sep` - Separator to place between chunks of parsed text

Returns the text parsed from the webpage or an error if one occured.

## Limitations

Text is parsed as-is from the initial content returned by the server. This means that content requiring additional network requests or user interactions is not available to the parser.

## Roadmap

* [ ] Unicode support

* [ ] Parse visible text from attributes

* [ ] Follow server redirects

* [x] Parse embedded iframes

* [ ] Parse embedded PDF text

## License

MIT licensed. Copyright (c) 2019-2020 Joshua Chumbley. See the LICENSE file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chxmbley/sitex

Awesome Lists containing this project

README