{"id":22468874,"url":"https://github.com/katorres02/ruby-content-parser","last_synced_at":"2026-05-10T19:24:40.162Z","repository":{"id":93000405,"uuid":"108944481","full_name":"katorres02/ruby-content-parser","owner":"katorres02","description":"web scrap in ruby with nokogiri","archived":false,"fork":false,"pushed_at":"2017-10-31T14:20:27.000Z","size":201,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-01T19:44:04.080Z","etag":null,"topics":["nokogiri","ruby","rubyonrails","scraping-websites","webcrawler"],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/katorres02.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-10-31T04:16:44.000Z","updated_at":"2018-03-06T03:56:35.000Z","dependencies_parsed_at":"2023-03-25T13:49:07.575Z","dependency_job_id":null,"html_url":"https://github.com/katorres02/ruby-content-parser","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/katorres02%2Fruby-content-parser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/katorres02%2Fruby-content-parser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/katorres02%2Fruby-content-parser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/katorres02%2Fruby-content-parser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/katorres02","download_url":"https://codeload.github.com/katorres02/ruby-content-parser/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245874441,"owners_count":20686764,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["nokogiri","ruby","rubyonrails","scraping-websites","webcrawler"],"created_at":"2024-12-06T11:20:05.289Z","updated_at":"2026-05-10T19:24:35.109Z","avatar_url":"https://github.com/katorres02.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# INDEX CONTENT WITH RUBY\n\nThis is an example of indexing html content using Ruby on Rails and [nokogiri gem](https://github.com/sparklemotion/nokogiri).\n\n### Installation\n* Clone the repository `git clone https://github.com/katorres02/ruby-content-parser`\n* Install gems `bundle install`\n* Create database `rake db:create db:migrate`\n* Run tests `bin/rake`\n* Run server `rails s`\n\n### Web Usage\nYou can see a live [Demo here](https://rocky-shelf-60680.herokuapp.com/).\n\nEvery url indexed with the api is stored in a database. You can see this information in the web dashboard or you can call one of the api endpoints for this.\n\n* Indexed Urls dashboard: \n![alt text](https://raw.githubusercontent.com/katorres02/ruby-content-parser/master/app/assets/images/index.png \"Dashboard\")\n\n* Content stored for url: \n![alt text](https://raw.githubusercontent.com/katorres02/ruby-content-parser/master/app/assets/images/show.png \"details\")\n\n\n### API Usage\n\nThere is a resource called \"page\" that contains 2 webservices. One for search, index and store information of an specific html tag and the other for retrieve stored information for one url.\n\n* `POST http://HOST_URL/api/v1/pages`\n  Index content from an url.\n#### Request\n\n| Name        | Description           | Example  |\n| ------------- |:-------------:| -----:|\n| url      | Target url you want to index | https://github.com/sparklemotion/nokogiri |\n| tags      | Tag or Tags you want to search, in case you want more tha one you can separate them by commas |   h1,h2,h3,a |\n\n#### Response\n\n| Name        | Description           |\n| ------------- |:-------------:|\n| id      | Database uniq identifier |\n| url      | Url scanned |\n| stored_tags      | array of indexed tags |\n| stored_elements      | array of Elements for each tag |\n| stored_elements[id]      |Element database uniq identifier |\n| stored_elements[tag]      |Element html tag that belongs |\n| stored_elements[html]      |Element string inside the html tag, this contains html code |\n| stored_elements[content]      |Element string visible by users. This is the text that a normal user can see in the page|\n| stored_elements[href]      |Element href url. Only for links (a)|\n\n#### Example\n##### Request example\n`POST http://HOST_URL/api/v1/pages`\n\nparams\n```json\n{ \"url\": \"https://github.com/sparklemotion/nokogiri\", \"tags\": \"h1\" }\n```\n##### Response example\n```json\n{\n    \"page\": {\n        \"id\": 1,\n        \"url\": \"https://github.com/sparklemotion/nokogiri\",\n        \"stored_elements\": [\n            {\n                \"stored_element\": {\n                    \"id\": 1,\n                    \"tag\": \"h1\",\n                    \"html\": \"\u003ch1 class=\\\"public \\\"\u003e\\n  \u003csvg aria-hidden=\\\"true\\\" class=\\\"octicon octicon-repo\\\" height=\\\"16\\\" version=\\\"1.1\\\" viewbox=\\\"0 0 12 16\\\" width=\\\"12\\\"\u003e\u003cpath fill-rule=\\\"evenodd\\\" d=\\\"M4 9H3V8h1v1zm0-3H3v1h1V6zm0-2H3v1h1V4zm0-2H3v1h1V2zm8-1v12c0 .55-.45 1-1 1H6v2l-1.5-1.5L3 16v-2H1c-.55 0-1-.45-1-1V1c0-.55.45-1 1-1h10c.55 0 1 .45 1 1zm-1 10H1v2h2v-1h3v1h5v-2zm0-10H2v9h9V1z\\\"\u003e\u003c/path\u003e\u003c/svg\u003e\\n  \u003cspan class=\\\"author\\\" itemprop=\\\"author\\\"\u003e\u003ca href=\\\"/sparklemotion\\\" class=\\\"url fn\\\" rel=\\\"author\\\"\u003esparklemotion\u003c/a\u003e\u003c/span\u003e\u003c!--\\n--\u003e\u003cspan class=\\\"path-divider\\\"\u003e/\u003c/span\u003e\u003c!--\\n--\u003e\u003cstrong itemprop=\\\"name\\\"\u003e\u003ca href=\\\"/sparklemotion/nokogiri\\\" data-pjax=\\\"#js-repo-pjax-container\\\"\u003enokogiri\u003c/a\u003e\u003c/strong\u003e\\n\\n\u003c/h1\u003e\",\n                    \"content\": \"\\n  \\n  sparklemotion/nokogiri\\n\\n\",\n                    \"href\": null\n                }\n            },\n            {\n                \"stored_element\": {\n                    \"id\": 2,\n                    \"tag\": \"h1\",\n                    \"html\": \"\u003ch1\u003e\\n\u003ca href=\\\"#nokogiri\\\" aria-hidden=\\\"true\\\" class=\\\"anchor\\\" id=\\\"user-content-nokogiri\\\"\u003e\u003csvg aria-hidden=\\\"true\\\" class=\\\"octicon octicon-link\\\" height=\\\"16\\\" version=\\\"1.1\\\" viewbox=\\\"0 0 16 16\\\" width=\\\"16\\\"\u003e\u003cpath fill-rule=\\\"evenodd\\\" d=\\\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\\\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003eNokogiri\u003c/h1\u003e\",\n                    \"content\": \"Nokogiri\",\n                    \"href\": null\n                }\n            }\n        ],\n        \"stored_tags\": [\n            \"h1\"\n        ]\n    }\n}\n```\n\n\n* `GET http://HOST_URL/api/v1/pages.json?id=STORED_URL`\n  Return stored info from an URL.\n#### Request\n\n| Name        | Description           | Example  |\n| ------------- |:-------------:| -----:|\n| id      | Url you want to see | https://github.com/sparklemotion/nokogiri |\n\n#### Response\n\n| Name        | Description           |\n| ------------- |:-------------:|\n| id      | Database uniq identifier |\n| url      | Url scanned |\n| stored_tags      | array of indexed tags |\n| stored_elements      | array of Elements for each tag |\n| stored_elements[id]      |Element database uniq identifier |\n| stored_elements[tag]      |Element html tag that belongs |\n| stored_elements[html]      |Element string inside the html tag, this contains html code |\n| stored_elements[content]      |Element string visible by users. This is the text that a normal user can see in the page|\n| stored_elements[href]      |Element href url. Only for links (a)|\n\n#### Example\n##### Request example\n`GET http://HOST_URL/api/v1/pages.json?id=https://github.com/sparklemotion/nokogiri`\n\nparams\n```json\n{ \"id\": \"https://github.com/sparklemotion/nokogiri\" }\n```\n##### Response example\n```json\n{\n    \"page\": {\n        \"id\": 1,\n        \"url\": \"https://github.com/sparklemotion/nokogiri\",\n        \"stored_elements\": [\n            {\n                \"stored_element\": {\n                    \"id\": 1,\n                    \"tag\": \"h1\",\n                    \"html\": \"\u003ch1 class=\\\"public \\\"\u003e\\n  \u003csvg aria-hidden=\\\"true\\\" class=\\\"octicon octicon-repo\\\" height=\\\"16\\\" version=\\\"1.1\\\" viewbox=\\\"0 0 12 16\\\" width=\\\"12\\\"\u003e\u003cpath fill-rule=\\\"evenodd\\\" d=\\\"M4 9H3V8h1v1zm0-3H3v1h1V6zm0-2H3v1h1V4zm0-2H3v1h1V2zm8-1v12c0 .55-.45 1-1 1H6v2l-1.5-1.5L3 16v-2H1c-.55 0-1-.45-1-1V1c0-.55.45-1 1-1h10c.55 0 1 .45 1 1zm-1 10H1v2h2v-1h3v1h5v-2zm0-10H2v9h9V1z\\\"\u003e\u003c/path\u003e\u003c/svg\u003e\\n  \u003cspan class=\\\"author\\\" itemprop=\\\"author\\\"\u003e\u003ca href=\\\"/sparklemotion\\\" class=\\\"url fn\\\" rel=\\\"author\\\"\u003esparklemotion\u003c/a\u003e\u003c/span\u003e\u003c!--\\n--\u003e\u003cspan class=\\\"path-divider\\\"\u003e/\u003c/span\u003e\u003c!--\\n--\u003e\u003cstrong itemprop=\\\"name\\\"\u003e\u003ca href=\\\"/sparklemotion/nokogiri\\\" data-pjax=\\\"#js-repo-pjax-container\\\"\u003enokogiri\u003c/a\u003e\u003c/strong\u003e\\n\\n\u003c/h1\u003e\",\n                    \"content\": \"\\n  \\n  sparklemotion/nokogiri\\n\\n\",\n                    \"href\": null\n                }\n            },\n            {\n                \"stored_element\": {\n                    \"id\": 2,\n                    \"tag\": \"h1\",\n                    \"html\": \"\u003ch1\u003e\\n\u003ca href=\\\"#nokogiri\\\" aria-hidden=\\\"true\\\" class=\\\"anchor\\\" id=\\\"user-content-nokogiri\\\"\u003e\u003csvg aria-hidden=\\\"true\\\" class=\\\"octicon octicon-link\\\" height=\\\"16\\\" version=\\\"1.1\\\" viewbox=\\\"0 0 16 16\\\" width=\\\"16\\\"\u003e\u003cpath fill-rule=\\\"evenodd\\\" d=\\\"M4 9h1v1H4c-1.5 0-3-1.69-3-3.5S2.55 3 4 3h4c1.45 0 3 1.69 3 3.5 0 1.41-.91 2.72-2 3.25V8.59c.58-.45 1-1.27 1-2.09C10 5.22 8.98 4 8 4H4c-.98 0-2 1.22-2 2.5S3 9 4 9zm9-3h-1v1h1c1 0 2 1.22 2 2.5S13.98 12 13 12H9c-.98 0-2-1.22-2-2.5 0-.83.42-1.64 1-2.09V6.25c-1.09.53-2 1.84-2 3.25C6 11.31 7.55 13 9 13h4c1.45 0 3-1.69 3-3.5S14.5 6 13 6z\\\"\u003e\u003c/path\u003e\u003c/svg\u003e\u003c/a\u003eNokogiri\u003c/h1\u003e\",\n                    \"content\": \"Nokogiri\",\n                    \"href\": null\n                }\n            }\n        ],\n        \"stored_tags\": [\n            \"h1\"\n        ]\n    }\n}\n```\n\n\n### Credits\n\n* [Carlos Torres](https://github.com/katorres02) author\n* [Nokogiri Gem](https://github.com/sparklemotion/nokogiri)\n\n### License\n\nApache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkatorres02%2Fruby-content-parser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkatorres02%2Fruby-content-parser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkatorres02%2Fruby-content-parser/lists"}