An open API service indexing awesome lists of open source software.

https://github.com/katorres02/ruby-content-parser

web scrap in ruby with nokogiri
https://github.com/katorres02/ruby-content-parser

nokogiri ruby rubyonrails scraping-websites webcrawler

Last synced: about 2 months ago
JSON representation

web scrap in ruby with nokogiri

Awesome Lists containing this project

README

          

# INDEX CONTENT WITH RUBY

This is an example of indexing html content using Ruby on Rails and [nokogiri gem](https://github.com/sparklemotion/nokogiri).

### Installation
* Clone the repository `git clone https://github.com/katorres02/ruby-content-parser`
* Install gems `bundle install`
* Create database `rake db:create db:migrate`
* Run tests `bin/rake`
* Run server `rails s`

### Web Usage
You can see a live [Demo here](https://rocky-shelf-60680.herokuapp.com/).

Every url indexed with the api is stored in a database. You can see this information in the web dashboard or you can call one of the api endpoints for this.

* Indexed Urls dashboard:
![alt text](https://raw.githubusercontent.com/katorres02/ruby-content-parser/master/app/assets/images/index.png "Dashboard")

* Content stored for url:
![alt text](https://raw.githubusercontent.com/katorres02/ruby-content-parser/master/app/assets/images/show.png "details")

### API Usage

There is a resource called "page" that contains 2 webservices. One for search, index and store information of an specific html tag and the other for retrieve stored information for one url.

* `POST http://HOST_URL/api/v1/pages`
Index content from an url.
#### Request

| Name | Description | Example |
| ------------- |:-------------:| -----:|
| url | Target url you want to index | https://github.com/sparklemotion/nokogiri |
| tags | Tag or Tags you want to search, in case you want more tha one you can separate them by commas | h1,h2,h3,a |

#### Response

| Name | Description |
| ------------- |:-------------:|
| id | Database uniq identifier |
| url | Url scanned |
| stored_tags | array of indexed tags |
| stored_elements | array of Elements for each tag |
| stored_elements[id] |Element database uniq identifier |
| stored_elements[tag] |Element html tag that belongs |
| stored_elements[html] |Element string inside the html tag, this contains html code |
| stored_elements[content] |Element string visible by users. This is the text that a normal user can see in the page|
| stored_elements[href] |Element href url. Only for links (a)|

#### Example
##### Request example
`POST http://HOST_URL/api/v1/pages`

params
```json
{ "url": "https://github.com/sparklemotion/nokogiri", "tags": "h1" }
```
##### Response example
```json
{
"page": {
"id": 1,
"url": "https://github.com/sparklemotion/nokogiri",
"stored_elements": [
{
"stored_element": {
"id": 1,
"tag": "h1",
"html": "

\n \n sparklemotion/nokogiri\n\n

",
"content": "\n \n sparklemotion/nokogiri\n\n",
"href": null
}
},
{
"stored_element": {
"id": 2,
"tag": "h1",
"html": "

\nNokogiri

",
"content": "Nokogiri",
"href": null
}
}
],
"stored_tags": [
"h1"
]
}
}
```

* `GET http://HOST_URL/api/v1/pages.json?id=STORED_URL`
Return stored info from an URL.
#### Request

| Name | Description | Example |
| ------------- |:-------------:| -----:|
| id | Url you want to see | https://github.com/sparklemotion/nokogiri |

#### Response

| Name | Description |
| ------------- |:-------------:|
| id | Database uniq identifier |
| url | Url scanned |
| stored_tags | array of indexed tags |
| stored_elements | array of Elements for each tag |
| stored_elements[id] |Element database uniq identifier |
| stored_elements[tag] |Element html tag that belongs |
| stored_elements[html] |Element string inside the html tag, this contains html code |
| stored_elements[content] |Element string visible by users. This is the text that a normal user can see in the page|
| stored_elements[href] |Element href url. Only for links (a)|

#### Example
##### Request example
`GET http://HOST_URL/api/v1/pages.json?id=https://github.com/sparklemotion/nokogiri`

params
```json
{ "id": "https://github.com/sparklemotion/nokogiri" }
```
##### Response example
```json
{
"page": {
"id": 1,
"url": "https://github.com/sparklemotion/nokogiri",
"stored_elements": [
{
"stored_element": {
"id": 1,
"tag": "h1",
"html": "

\n \n sparklemotion/nokogiri\n\n

",
"content": "\n \n sparklemotion/nokogiri\n\n",
"href": null
}
},
{
"stored_element": {
"id": 2,
"tag": "h1",
"html": "

\nNokogiri

",
"content": "Nokogiri",
"href": null
}
}
],
"stored_tags": [
"h1"
]
}
}
```

### Credits

* [Carlos Torres](https://github.com/katorres02) author
* [Nokogiri Gem](https://github.com/sparklemotion/nokogiri)

### License

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/