{"id":16500471,"url":"https://github.com/carsonsgit/ruby-webscraper","last_synced_at":"2026-06-11T15:31:05.822Z","repository":{"id":254776760,"uuid":"847506461","full_name":"carsonSgit/Ruby-WebScraper","owner":"carsonSgit","description":null,"archived":false,"fork":false,"pushed_at":"2024-08-30T23:46:26.000Z","size":21,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-28T19:43:59.763Z","etag":null,"topics":["ruby","web-scraper"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/carsonSgit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-26T01:50:55.000Z","updated_at":"2025-08-10T19:26:43.000Z","dependencies_parsed_at":"2025-01-12T10:43:31.804Z","dependency_job_id":"373b032c-6da3-48e3-8fad-64076b41f7dc","html_url":"https://github.com/carsonSgit/Ruby-WebScraper","commit_stats":null,"previous_names":["carsonsgit/ruby-webscraper"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/carsonSgit/Ruby-WebScraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carsonSgit%2FRuby-WebScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carsonSgit%2FRuby-WebScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carsonSgit%2FRuby-WebScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carsonSgit%2FRuby-WebScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/carsonSgit","download_url":"https://codeload.github.com/carsonSgit/Ruby-WebScraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/carsonSgit%2FRuby-WebScraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34206487,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ruby","web-scraper"],"created_at":"2024-10-11T14:57:41.543Z","updated_at":"2026-06-11T15:31:05.788Z","avatar_url":"https://github.com/carsonSgit.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv\u003e\n \u003cimg src=\"https://github.com/user-attachments/assets/34490945-20ba-4001-9a43-61bbd026f45f\" width=\"500\"/\u003e\n\u003c/div\u003e\n\n# Web Scraper 🤖\nLearning Ruby one step at a time...\n\n\u003e [!IMPORTANT]\n\u003e This is my introduction to Ruby! I have experience with similar languages (i.e. `Python`, `Kotlin`, `JavaScript`, etc.) but have never even seen the syntax before making this, so it may not be my best work 😢.\n\u003e\u003e A wise man once told me, \"Learn Ruby, you'll like it.\"\n\n## Overview 🌍\n\nThis simple web scraper takes any user-inputted `URL`, scrapes all **hyperlinks** from the given `URL`, and outputs them to a **CSV**. This can easily be integrated into a machine learning project to routinely update a **CSV**. All you need to do is update the content the `Nokogiri` `doc` object is looking for, just like any other scraper.\n\n## How to Use 🔧\n\n1. Install the necessary gems:\n    \u003e These are the pre-built packages/libraries that have functionalities leveraged in this web scraper\n   ```bash\n   gem install nokogiri\n   gem install csv\n   ```\n\n2. Run the scraper script:\n   ```ruby\n    require 'nokogiri'\n    require 'open-uri'\n    require 'csv'\n    require 'uri'\n\n    puts \"Please input the URL you want to scrape: \"\n\n    url = ARGV[0] || gets.chomp\n    output_file = ARGV[1] || 'scrapedData.csv'\n\n    begin\n    html = URI.open(url, \"User-Agent\" =\u003e \"Mozilla/5.0\")\n    doc = Nokogiri::HTML(html)\n\n    links = doc.css('a')\n    filtered_links = links.select { |link| link['href'] =~ /^http/ }\n\n    CSV.open(output_file, 'wb') do |csv|\n        csv \u003c\u003c ['Index', 'Title', 'Link']\n        filtered_links.each_with_index do |link, index|\n        title = link.text.strip.empty? ? \"No Title\" : link.text.strip\n        absolute_link = URI.join(url, link['href']).to_s\n        csv \u003c\u003c [index + 1, title, absolute_link]\n        end\n    end\n\n    puts \"Total links found: #{filtered_links.size}\"\n    puts \"Links saved to #{output_file}\"\n\n    rescue OpenURI::HTTPError =\u003e e\n        puts \"HTTP Error: #{e.message}\"\n    rescue CSV::MalformedCSVError =\u003e e\n        puts \"CSV Error: #{e.message}\"\n    rescue StandardError =\u003e e\n        puts \"An error occurred: #{e.message}\"\n    end\n   ```\n\n   \u003e Here are some sample URLs you could use:\n   \u003e\u003e ```\n   \u003e\u003e https://www.bbc.com/news\n   \u003e\u003e https://www.theweathernetwork.com/en\n   \u003e\u003e https://github.com/\n   \u003e\u003e ```\n\n3. Check the generated `scrapedData.csv` file for the scraped hyperlinks.\n\n## Sample Outputs 📊\n\nAfter running the scraper on a sample URL, your `scrapedData.csv` might look like this:\n\n```\nIndex,Title,Link\n1,Audio,https://www.bbc.co.uk/sounds\n2,Weather,https://www.bbc.com/weather\n3,Newsletters,https://www.bbc.com/newsletters\n```\n\n## Resources Used 📚 \n\n- [Ruby Docs](https://www.ruby-lang.org/en/documentation/): General Ruby docs (install, syntax, etc.).\n- [Nokogiri](https://nokogiri.org/index.html#parsing-and-querying): Web scraping parser.\n- [Gets Method](https://www.codecademy.com/resources/docs/ruby/user-input): How Ruby gets user input.\n- [Ruby CSV](https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html): Ruby CSV documentation.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcarsonsgit%2Fruby-webscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcarsonsgit%2Fruby-webscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcarsonsgit%2Fruby-webscraper/lists"}