{"id":13411969,"url":"https://github.com/jaimeiniesta/metainspector","last_synced_at":"2025-04-09T03:09:29.621Z","repository":{"id":409601,"uuid":"28706","full_name":"jaimeiniesta/metainspector","owner":"jaimeiniesta","description":"Ruby gem for web scraping purposes. It scrapes a given URL, and returns you its title, meta description, meta keywords, links, images...","archived":false,"fork":false,"pushed_at":"2024-06-15T23:23:24.000Z","size":1148,"stargazers_count":1032,"open_issues_count":27,"forks_count":165,"subscribers_count":26,"default_branch":"master","last_synced_at":"2024-10-29T15:48:48.733Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://github.com/metainspector/metainspector","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"intellectsoft-uk/MssqlBundle","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jaimeiniesta.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"MIT-LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"jaimeiniesta"}},"created_at":"2008-06-25T21:49:27.000Z","updated_at":"2024-10-23T00:49:31.000Z","dependencies_parsed_at":"2023-07-07T04:17:01.089Z","dependency_job_id":"bb351def-952c-4997-bd62-35b60cd24ddc","html_url":"https://github.com/jaimeiniesta/metainspector","commit_stats":{"total_commits":629,"total_committers":46,"mean_commits":"13.673913043478262","dds":"0.41017488076311603","last_synced_commit":"68142829435abba90a752b4fa8a6b624023c3459"},"previous_names":["metainspector/metainspector"],"tags_count":117,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaimeiniesta%2Fmetainspector","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaimeiniesta%2Fmetainspector/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaimeiniesta%2Fmetainspector/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jaimeiniesta%2Fmetainspector/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jaimeiniesta","download_url":"https://codeload.github.com/jaimeiniesta/metainspector/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247968280,"owners_count":21025822,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T20:01:19.606Z","updated_at":"2025-04-09T03:09:29.600Z","avatar_url":"https://github.com/jaimeiniesta.png","language":"Ruby","funding_links":["https://github.com/sponsors/jaimeiniesta"],"categories":["Ruby","Web Crawling","Web Apps, Services \u0026 Interaction"],"sub_categories":["Web Content Scrapers"],"readme":"# MetaInspector\n[![Gem Version](https://badge.fury.io/rb/metainspector.svg)](http://badge.fury.io/rb/metainspector) [![CircleCI](https://circleci.com/gh/jaimeiniesta/metainspector.svg?style=svg)](https://circleci.com/gh/jaimeiniesta/metainspector) [![Code Climate](https://codeclimate.com/github/jaimeiniesta/metainspector/badges/gpa.svg)](https://codeclimate.com/github/jaimeiniesta/metainspector) [![Mentioned in Awesome Ruby](https://awesome.re/mentioned-badge.svg)](https://github.com/markets/awesome-ruby)\n\nMetaInspector is a gem for web scraping purposes.\n\nYou give it an URL, and it lets you easily get its title, links, images, charset, description, keywords, meta tags...\n\n## Installation\n\nInstall the gem from RubyGems:\n\n```bash\ngem install metainspector\n```\n\nIf you're using it on a Rails application, just add it to your Gemfile and run `bundle install`\n\n```ruby\ngem 'metainspector'\n```\n\nSupported Ruby versions are defined in [`.circleci/config.yml`](.circleci/config.yml).\n\n## Usage\n\nInitialize a MetaInspector instance for an URL, like this:\n\n```ruby\npage = MetaInspector.new('http://sitevalidator.com')\n```\n\nIf you don't include the scheme on the URL, http:// will be used by default:\n\n```ruby\npage = MetaInspector.new('sitevalidator.com')\n```\n\nYou can also include the html which will be used as the document to scrape:\n\n```ruby\npage = MetaInspector.new(\"http://sitevalidator.com\",\n                         :document =\u003e \"\u003chtml\u003e...\u003c/html\u003e\")\n```\n\n## Accessing response\n\nYou can check the status and headers from the response like this:\n\n```ruby\npage.response.status  # 200\npage.response.headers # { \"server\"=\u003e\"nginx\", \"content-type\"=\u003e\"text/html; charset=utf-8\",\n                      #   \"cache-control\"=\u003e\"must-revalidate, private, max-age=0\", ... }\n```\n\n## Accessing scraped data\n\n### URL\n\n```ruby\npage.url                 # URL of the page\npage.tracked?            # returns true if the url contains known tracking parameters\npage.untracked_url       # returns the url with the known tracking parameters removed\npage.untrack!            # removes the known tracking parameters from the url\npage.scheme              # Scheme of the page (http, https)\npage.host                # Hostname of the page (like, sitevalidator.com, without the scheme)\npage.root_url            # Root url (scheme + host, like http://sitevalidator.com/)\n```\n\n### Head links\n\n```ruby\npage.head_links          # an array of hashes of all head/links\npage.stylesheets         # an array of hashes of all head/links where rel='stylesheet'\npage.canonicals          # an array of hashes of all head/links where rel='canonical'\npage.feeds               # Get rss or atom links in meta data fields as array of hash in the form { href: \"...\", title: \"...\", type: \"...\" }\n```\n\n### Texts\n\n```ruby\npage.title               # title of the page from the head section, as string\npage.best_title          # best title of the page, from a selection of candidates\npage.author              # author of the page from the meta author tag\npage.best_author         # best author of the page, from a selection of candidates\npage.description         # returns the meta description\npage.best_description    # returns the first non-empty description between the following candidates: standard meta description, og:description, twitter:description, the first long paragraph\npage.h1                  # returns h1 text array\npage.h2                  # returns h2 text array\npage.h3                  # returns h3 text array\npage.h4                  # returns h4 text array\npage.h5                  # returns h5 text array\npage.h6                  # returns h6 text array\n```\n\n### Links\n\n```ruby\npage.links.raw           # every link found, unprocessed\npage.links.all           # every link found on the page as an absolute URL\npage.links.http          # every HTTP link found\npage.links.non_http      # every non-HTTP link found\npage.links.internal      # every internal link found on the page as an absolute URL\npage.links.external      # every external link found on the page as an absolute URL\n```\n\n### Images\n\n```ruby\npage.images              # enumerable collection, with every img found on the page as an absolute URL\npage.images.with_size    # a sorted array (by descending area) of [image_url, width, height]\npage.images.best         # Most relevant image, if defined with the og:image or twitter:image metatags. Fallback to the first page.images array element\npage.images.favicon      # absolute URL to the favicon\n```\n\n### Meta tags\n\nWhen it comes to meta tags, you have several options:\n\n```ruby\npage.meta_tags  # Gives you all the meta tags by type:\n                # (meta name, meta http-equiv, meta property and meta charset)\n                # As meta tags can be repeated (in the case of 'og:image', for example),\n                # the values returned will be arrays\n                #\n                # For example:\n                #\n                # {\n                    'name' =\u003e {\n                                'keywords'       =\u003e ['one, two, three'],\n                                'description'    =\u003e ['the description'],\n                                'author'         =\u003e ['Joe Sample'],\n                                'robots'         =\u003e ['index,follow'],\n                                'revisit'        =\u003e ['15 days'],\n                                'dc.date.issued' =\u003e ['2011-09-15']\n                              },\n\n                    'http-equiv' =\u003e {\n                                        'content-type'        =\u003e ['text/html; charset=UTF-8'],\n                                        'content-style-type'  =\u003e ['text/css']\n                                    },\n\n                    'property' =\u003e {\n                                    'og:title'        =\u003e ['An OG title'],\n                                    'og:type'         =\u003e ['website'],\n                                    'og:url'          =\u003e ['http://example.com/meta-tags'],\n                                    'og:image'        =\u003e ['http://example.com/rock.jpg',\n                                                          'http://example.com/rock2.jpg',\n                                                          'http://example.com/rock3.jpg'],\n                                    'og:image:width'  =\u003e ['300'],\n                                    'og:image:height' =\u003e ['300', '1000']\n                                   },\n\n                    'charset' =\u003e ['UTF-8']\n                  }\n```\n\nAs this method returns a hash, you can also take only the key that you need, like in:\n\n```ruby\npage.meta_tags['property']  # Returns:\n                            # {\n                            #   'og:title'        =\u003e ['An OG title'],\n                            #   'og:type'         =\u003e ['website'],\n                            #   'og:url'          =\u003e ['http://example.com/meta-tags'],\n                            #   'og:image'        =\u003e ['http://example.com/rock.jpg',\n                            #                         'http://example.com/rock2.jpg',\n                            #                         'http://example.com/rock3.jpg'],\n                            #   'og:image:width'  =\u003e ['300'],\n                            #   'og:image:height' =\u003e ['300', '1000']\n                            # }\n```\n\nIn most cases you will only be interested in the first occurrence of a meta tag, so you can\nuse the singular form of that method:\n\n```ruby\npage.meta_tag['name']   # Returns:\n                        # {\n                        #   'keywords'       =\u003e 'one, two, three',\n                        #   'description'    =\u003e 'the description',\n                        #   'author'         =\u003e 'Joe Sample',\n                        #   'robots'         =\u003e 'index,follow',\n                        #   'revisit'        =\u003e '15 days',\n                        #   'dc.date.issued' =\u003e '2011-09-15'\n                        # }\n```\n\nOr, as this is also a hash:\n\n```ruby\npage.meta_tag['name']['keywords']    # Returns 'one, two, three'\n```\n\nAnd finally, you can use the shorter `meta` method that will merge the different keys so you have\na simpler hash:\n\n```ruby\npage.meta   # Returns:\n            #\n            # {\n            #   'keywords'            =\u003e 'one, two, three',\n            #   'description'         =\u003e 'the description',\n            #   'author'              =\u003e 'Joe Sample',\n            #   'robots'              =\u003e 'index,follow',\n            #   'revisit'             =\u003e '15 days',\n            #   'dc.date.issued'      =\u003e '2011-09-15',\n            #   'content-type'        =\u003e 'text/html; charset=UTF-8',\n            #   'content-style-type'  =\u003e 'text/css',\n            #   'og:title'            =\u003e 'An OG title',\n            #   'og:type'             =\u003e 'website',\n            #   'og:url'              =\u003e 'http://example.com/meta-tags',\n            #   'og:image'            =\u003e 'http://example.com/rock.jpg',\n            #   'og:image:width'      =\u003e '300',\n            #   'og:image:height'     =\u003e '300',\n            #   'charset'             =\u003e 'UTF-8'\n            # }\n```\n\nThis way, you can get most meta tags just like that:\n\n```ruby\npage.meta['author']     # Returns \"Joe Sample\"\n```\n\nPlease be aware that all keys are converted to downcase, so it's `'dc.date.issued'` and not `'DC.date.issued'`.\n\n### Misc\n\n```ruby\npage.charset             # UTF-8\npage.content_type        # content-type returned by the server when the url was requested\n```\n\n## Other representations\n\nYou can also access most of the scraped data as a hash:\n\n```ruby\npage.to_hash    # { \"url\"   =\u003e \"http://sitevalidator.com\",\n                    \"title\" =\u003e \"MarkupValidator :: site-wide markup validation tool\", ... }\n```\n\nThe original document is accessible from:\n\n```ruby\npage.to_s         # A String with the contents of the HTML document\n```\n\nAnd the full scraped document is accessible from:\n\n```ruby\npage.parsed  # Nokogiri doc that you can use it to get any element from the page\n```\n\n## Options\n\n### Forced encoding\n\nIf you get a `MetaInspector::RequestError, \"invalid byte sequence in UTF-8\"` or similar error, you can try forcing the encoding like this:\n\n```ruby\npage = MetaInspector.new(url, :encoding =\u003e 'UTF-8')\n```\n\n### Timeout \u0026 Retries\n\nYou can specify 2 different timeouts when requesting a page:\n\n* `connection_timeout` sets the maximum number of seconds to wait to get a connection to the page.\n* `read_timeout` sets the maximum number of seconds to wait to read the page, once connected.\n\nBoth timeouts default to 20 seconds each.\n\nYou can also specify the number of `retries`, which defaults to 3.\n\nFor example, this will time out after 10 seconds waiting for a connection, or after 5 seconds waiting\nto read its contents, and will retry 4 times:\n\n```ruby\npage = MetaInspector.new('www.google', :connection_timeout =\u003e 10, :read_timeout =\u003e 5, :retries =\u003e 4)\n```\n\nIf MetaInspector fails to fetch the page after it has exhausted its retries,\nit will raise `MetaInspector::TimeoutError`, which you can rescue in your\napplication code.\n\n```ruby\nbegin\n  page = MetaInspector.new(url)\nrescue MetaInspector::TimeoutError\n  enqueue_for_future_fetch_attempt(url)\n  render_simple(url)\nelse\n  render_rich(page)\nend\n```\n\n### Redirections\n\nBy default, MetaInspector will follow redirects (up to a limit of 10).\n\nIf you want to disallow redirects, you can do it like this:\n\n```ruby\npage = MetaInspector.new('facebook.com', :allow_redirections =\u003e false)\n```\n\nYou can also customize how many redirects you wish to allow:\n\n```ruby\npage = MetaInspector.new('facebook.com', :faraday_options =\u003e { redirect: { limit: 5 } })\n```\n\nAnd even customize what to do in between each redirect:\n\n```ruby\ncallback = proc do |previous_response, next_request|\n  ip_address = Resolv.getaddress(next_request.url.host)\n  raise 'Invalid address' if IPAddr.new(ip_address).private?\nend\n\npage = MetaInspector.new(url, faraday_options: { redirect: { callback: callback } })\n```\n\n\nThe `faraday_options[:redirect]` hash is passed to the `FollowRedirects` middleware used by `Faraday`, so that we can use all available options.\nCheck them [here](https://github.com/lostisland/faraday_middleware/blob/main/lib/faraday_middleware/response/follow_redirects.rb#L44).\n\n### Headers\n\nBy default, the following headers are set:\n\n```ruby\n{\n  'User-Agent'      =\u003e \"MetaInspector/#{MetaInspector::VERSION} (+https://github.com/jaimeiniesta/metainspector)\",\n  'Accept-Encoding' =\u003e 'identity'\n}\n```\n\nThe `Accept-Encoding` is set to `identity` to avoid exceptions being raised on servers that return malformed compressed responses, [as explained here](https://github.com/lostisland/faraday/issues/337).\n\nIf you want to override the default headers then use the `headers` option:\n\n```ruby\n# Set the User-Agent header\npage = MetaInspector.new('example.com', :headers =\u003e {'User-Agent' =\u003e 'My custom User-Agent'})\n```\n\n### Disabling SSL verification (or any other Faraday options)\n\nFaraday can be passed options via `:faraday_options`.\n\nThis is useful in cases where we need to\ncustomize the way we request the page, like for example disabling SSL verification, like this:\n\n```ruby\nMetaInspector.new('https://example.com')\n# Faraday::SSLError: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed\n\nMetaInspector.new('https://example.com', faraday_options: { ssl: { verify: false } })\n# Now we can access the page\n```\n\n### Allow non-HTML content type\n\nMetaInspector will by default raise an exception when trying to parse a non-HTML URL (one that has a content-type different than text/html). You can disable this behaviour with:\n\n```ruby\npage = MetaInspector.new('sitevalidator.com', :allow_non_html_content =\u003e true)\n```\n\n```ruby\npage = MetaInspector.new('http://example.com/image.png')\npage.content_type  # \"image/png\"\npage.description   # will raise an exception\n\npage = MetaInspector.new('http://example.com/image.png', :allow_non_html_content =\u003e true)\npage.content_type  # \"image/png\"\npage.description   # will return a garbled string\n```\n\n### URL Normalization\n\nBy default, URLs are normalized using the Addressable gem. For example:\n\n```ruby\n# Normalization will add a default scheme and a trailing slash...\npage = MetaInspector.new('sitevalidator.com')\npage.url # http://sitevalidator.com/\n\n# ...and it will also convert international characters\npage = MetaInspector.new('http://www.詹姆斯.com')\npage.url # http://www.xn--8ws00zhy3a.com/\n```\n\nWhile this is generally useful, it can be [tricky](https://github.com/sporkmonger/addressable/issues/182) [sometimes](https://github.com/sporkmonger/addressable/issues/160).\n\nYou can disable URL normalization by passing the `normalize_url: false` option.\n\n### Image downloading\n\nWhen you ask for the largest image on the page with `page.images.largest`, it will be determined by its height and width attributes on the HTML markup, and also by downloading a small portion of each image using the [fastimage](https://github.com/sdsykes/fastimage) gem. This is really fast as it doesn't download the entire images, normally just the headers of the image files.\n\nIf you want to disable this, you can specify it like this:\n\n```ruby\npage = MetaInspector.new('http://example.com', download_images: false)\n```\n\n### Caching responses\n\nMetaInspector can be configured to use [Faraday::HttpCache](https://github.com/plataformatec/faraday-http-cache) to cache page responses. For that you should pass the `faraday_http_cache` option with at least the `:store` key, for example:\n\n```ruby\ncache = ActiveSupport::Cache.lookup_store(:file_store, '/tmp/cache')\npage = MetaInspector.new('http://example.com', faraday_http_cache: { store: cache })\n```\n\n## Exception Handling\n\nWeb page scraping is tricky, you can expect to find different exceptions during the request of the page or the parsing of its contents. MetaInspector will encapsulate these exceptions on these main errors:\n\n* `MetaInspector::TimeoutError`. When fetching a web page has taken too long.\n* `MetaInspector::RequestError`. When there has been an error on the request phase. Examples: page not found, SSL failure, invalid URI.\n* `MetaInspector::ParserError`. When there has been an error parsing the contents of the page.\n* `MetaInspector::NonHtmlError`. When the contents of the page was not HTML. See also the `allow_non_html_content` option\n\n## Examples\n\nYou can find some sample scripts on the `examples` folder, including a basic scraping and a spider that will follow external links using a queue. What follows is an example of use from irb:\n\n```ruby\n$ irb\n\u003e\u003e require 'metainspector'\n=\u003e true\n\n\u003e\u003e page = MetaInspector.new('http://sitevalidator.com')\n=\u003e #\u003cMetaInspector:0x11330c0 @url=\"http://sitevalidator.com\"\u003e\n\n\u003e\u003e page.title\n=\u003e \"MarkupValidator :: site-wide markup validation tool\"\n\n\u003e\u003e page.meta['description']\n=\u003e \"Site-wide markup validation tool. Validate the markup of your whole site with just one click.\"\n\n\u003e\u003e page.meta['keywords']\n=\u003e \"html, markup, validation, validator, tool, w3c, development, standards, free\"\n\n\u003e\u003e page.links.size\n=\u003e 15\n\n\u003e\u003e page.links[4]\n=\u003e \"/plans-and-pricing\"\n```\n\n## Contributing guidelines\n\nYou're more than welcome to fork this project and send pull requests. Just remember to:\n\n* Create a topic branch for your changes.\n* Add specs.\n* Keep your fake responses as small as possible. For each change in `spec/fixtures`, a comment should be included explaining why it's needed.\n* Update `README.md` if needed (for example, when you're adding or changing a feature).\n\nThanks to all the contributors:\n\n[https://github.com/jaimeiniesta/metainspector/graphs/contributors](https://github.com/jaimeiniesta/metainspector/graphs/contributors)\n\nYou can also come to chat with us on our [Gitter room](https://gitter.im/jaimeiniesta/metainspector) and [Google group](https://groups.google.com/forum/#!forum/metainspector).\n\n## Related projects\n\n* [go-metainspector](https://github.com/fern4lvarez/go-metainspector), a port of MetaInspector for Go.\n* [Node-MetaInspector](https://github.com/gabceb/node-metainspector), a port of MetaInspector for Node.\n* [MetaInvestigator](https://github.com/nekova/metainvestigator), a port of MetaInspector for Elixir.\n* [Funkspector](https://github.com/jaimeiniesta/funkspector), another port of MetaInspector for Elixir.\n\n## License\nMetaInspector is released under the [MIT license](MIT-LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaimeiniesta%2Fmetainspector","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjaimeiniesta%2Fmetainspector","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjaimeiniesta%2Fmetainspector/lists"}