https://github.com/katorres02/ruby-content-parser
web scrap in ruby with nokogiri
https://github.com/katorres02/ruby-content-parser
nokogiri ruby rubyonrails scraping-websites webcrawler
Last synced: about 2 months ago
JSON representation
web scrap in ruby with nokogiri
- Host: GitHub
- URL: https://github.com/katorres02/ruby-content-parser
- Owner: katorres02
- License: apache-2.0
- Created: 2017-10-31T04:16:44.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2017-10-31T14:20:27.000Z (over 8 years ago)
- Last Synced: 2025-02-01T19:44:04.080Z (over 1 year ago)
- Topics: nokogiri, ruby, rubyonrails, scraping-websites, webcrawler
- Language: Ruby
- Size: 196 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# INDEX CONTENT WITH RUBY
This is an example of indexing html content using Ruby on Rails and [nokogiri gem](https://github.com/sparklemotion/nokogiri).
### Installation
* Clone the repository `git clone https://github.com/katorres02/ruby-content-parser`
* Install gems `bundle install`
* Create database `rake db:create db:migrate`
* Run tests `bin/rake`
* Run server `rails s`
### Web Usage
You can see a live [Demo here](https://rocky-shelf-60680.herokuapp.com/).
Every url indexed with the api is stored in a database. You can see this information in the web dashboard or you can call one of the api endpoints for this.
* Indexed Urls dashboard:

* Content stored for url:

### API Usage
There is a resource called "page" that contains 2 webservices. One for search, index and store information of an specific html tag and the other for retrieve stored information for one url.
* `POST http://HOST_URL/api/v1/pages`
Index content from an url.
#### Request
| Name | Description | Example |
| ------------- |:-------------:| -----:|
| url | Target url you want to index | https://github.com/sparklemotion/nokogiri |
| tags | Tag or Tags you want to search, in case you want more tha one you can separate them by commas | h1,h2,h3,a |
#### Response
| Name | Description |
| ------------- |:-------------:|
| id | Database uniq identifier |
| url | Url scanned |
| stored_tags | array of indexed tags |
| stored_elements | array of Elements for each tag |
| stored_elements[id] |Element database uniq identifier |
| stored_elements[tag] |Element html tag that belongs |
| stored_elements[html] |Element string inside the html tag, this contains html code |
| stored_elements[content] |Element string visible by users. This is the text that a normal user can see in the page|
| stored_elements[href] |Element href url. Only for links (a)|
#### Example
##### Request example
`POST http://HOST_URL/api/v1/pages`
params
```json
{ "url": "https://github.com/sparklemotion/nokogiri", "tags": "h1" }
```
##### Response example
```json
{
"page": {
"id": 1,
"url": "https://github.com/sparklemotion/nokogiri",
"stored_elements": [
{
"stored_element": {
"id": 1,
"tag": "h1",
"html": "
\n \n /nokogiri\n\n
",
"content": "\n \n sparklemotion/nokogiri\n\n",
"href": null
}
},
{
"stored_element": {
"id": 2,
"tag": "h1",
"html": "\nNokogiri
",
"content": "Nokogiri",
"href": null
}
}
],
"stored_tags": [
"h1"
]
}
}
```
* `GET http://HOST_URL/api/v1/pages.json?id=STORED_URL`
Return stored info from an URL.
#### Request
| Name | Description | Example |
| ------------- |:-------------:| -----:|
| id | Url you want to see | https://github.com/sparklemotion/nokogiri |
#### Response
| Name | Description |
| ------------- |:-------------:|
| id | Database uniq identifier |
| url | Url scanned |
| stored_tags | array of indexed tags |
| stored_elements | array of Elements for each tag |
| stored_elements[id] |Element database uniq identifier |
| stored_elements[tag] |Element html tag that belongs |
| stored_elements[html] |Element string inside the html tag, this contains html code |
| stored_elements[content] |Element string visible by users. This is the text that a normal user can see in the page|
| stored_elements[href] |Element href url. Only for links (a)|
#### Example
##### Request example
`GET http://HOST_URL/api/v1/pages.json?id=https://github.com/sparklemotion/nokogiri`
params
```json
{ "id": "https://github.com/sparklemotion/nokogiri" }
```
##### Response example
```json
{
"page": {
"id": 1,
"url": "https://github.com/sparklemotion/nokogiri",
"stored_elements": [
{
"stored_element": {
"id": 1,
"tag": "h1",
"html": "
\n \n /nokogiri\n\n
",
"content": "\n \n sparklemotion/nokogiri\n\n",
"href": null
}
},
{
"stored_element": {
"id": 2,
"tag": "h1",
"html": "\nNokogiri
",
"content": "Nokogiri",
"href": null
}
}
],
"stored_tags": [
"h1"
]
}
}
```
### Credits
* [Carlos Torres](https://github.com/katorres02) author
* [Nokogiri Gem](https://github.com/sparklemotion/nokogiri)
### License
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/