Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/alash3al/scraply

Scraply a simple dom scraper to fetch information from any html based website
https://github.com/alash3al/scraply

crawler crawling dom golang scraper scrapers scraping-websites scrapy server

Last synced: 2 days ago
JSON representation

Scraply a simple dom scraper to fetch information from any html based website

Host: GitHub
URL: https://github.com/alash3al/scraply
Owner: alash3al
License: apache-2.0
Created: 2019-05-29T15:50:34.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2022-07-07T20:14:48.000Z (almost 2 years ago)
Last Synced: 2024-06-18T23:00:15.417Z (4 days ago)
Topics: crawler, crawling, dom, golang, scraper, scrapers, scraping-websites, scrapy, server
Language: Go
Homepage:
Size: 47.9 KB
Stars: 126
Watchers: 8
Forks: 11
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-stars - alash3al/scraply - Scraply a simple dom scraper to fetch information from any html based website (Go)
awesome-stars - alash3al/scraply - Scraply a simple dom scraper to fetch information from any html based website (Go)
awesome-stars - alash3al/scraply - Scraply a simple dom scraper to fetch information from any html based website (Go)

README

        Scraply

========

> Scraply, is a very simple html scraping tool, if you know css & jQuery then you can use it!, scraply should be simple and tiny as well it could be used as a component in a large system something like [this use-case](https://github.com/alash3al/scraply/issues/1#issuecomment-570082314)

Overview

========

> you can use `scraply` within your stack via `cli` or `http`.  

```bash

# here is the CLI usage

# extracting the title and the description from scraply github repo page

$ scraply extract \

    -u "https://github.com/alash3al/scraply" \

    -x title='$("title").text()' \

    -x description='$("meta[name=description]").attr("content")'

# same thing but with custom user agent

$ scraply extract \

    -u "https://github.com/alash3al/scraply" \

    -ua "OptionalCustomUserAgent"\

    -x title='$("title").text()' \

    -x description='$("meta[name=description]").attr("content")'

# same thing but with asking scraply to return the response body for debugging purposes

$ scraply extract \

    --return-body \

    -u "https://github.com/alash3al/scraply" \

    -x title='$("title").text()' \

    -x description='$("meta[name=description]").attr("content")'

```

> for `http` usage, we will run the http server then using any http client to interact with it.  

```bash

# running the http server

# by default it listens on address ":8010" which equals to "0.0.0.0:8010"

# for more information execute `$ scraply help`

$ scraply serve

# then in another shell let's execute the following curl 

$ curl http://localhost:8010/extract \

    -H "Content-Type: application/json" \

    -s \

    -d '{"url": "https://github.com/alash3al/scraply", "extractors": {"title": "$(\"title\").text()"}, "return_body": false, "user_agent": "CustomeUserAgent"}'

```

> for debugging, there is `shell`

```bash

$ scraply shell -u https://github.com/alash3al/scraply

➜ (scraply) > $("title").text()

GitHub - alash3al/scraply: Scraply a simple dom scraper to fetch information from any html based website and convert that info to JSON APIs

➜ (scraply) > request.url

https://github.com/alash3al/scraply

➜ (scraply) > response.status_code

200

➜ (scraply) > response.url

https://github.com/alash3al/scraply

➜ (scraply) > response.body

.....

```

Download ?

==========

> you can go to the [releases page](https://github.com/alash3al/scraply/releases) and pick the latest version.

> or you can `$ docker run --rm -it ghcr.io/alash3al/scraply scraply help`

Contribution ?

==============

> for sure you can contribute, how?

- clone the repo

- create your fix/feature branch

- create a pull request

nothing else, enjoy!

About

=====

> I'm [Mohamed Al Ashaal](https://alash3al.com), a software engineer :)