Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/michaeltelford/broken_link_finder
Finds a websites broken links and reports back to you with a summary
https://github.com/michaeltelford/broken_link_finder
broken-link-finder broken-links links ruby website wgit
Last synced: about 2 months ago
JSON representation
Finds a websites broken links and reports back to you with a summary
- Host: GitHub
- URL: https://github.com/michaeltelford/broken_link_finder
- Owner: michaeltelford
- License: mit
- Created: 2017-01-07T12:57:42.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2024-01-19T13:11:37.000Z (12 months ago)
- Last Synced: 2024-04-22T13:20:44.726Z (9 months ago)
- Topics: broken-link-finder, broken-links, links, ruby, website, wgit
- Language: Ruby
- Homepage: https://rubygems.org/gems/broken_link_finder
- Size: 132 KB
- Stars: 5
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
- project-awesome - michaeltelford/broken_link_finder - Finds a websites broken links and reports back to you with a summary (Ruby)
README
# Broken Link Finder
Does what it says on the tin - finds a website's broken links.
Simply point it at a website and it will crawl all of its webpages searching for and identifing broken links. You will then be presented with a concise summary of any broken links found.
Broken Link Finder is multi-threaded and uses `libcurl` under the hood, it's fast!
## How It Works
Any HTML element within `` with a `href` or `src` attribute is considered a link (this is [configurable](#Link-Extraction) however).
For each link on a given page, any of the following conditions constitutes that the link is broken:
- An empty HTML response body is returned.
- A response status code of `404 Not Found` is returned.
- The HTML response body doesn't contain an element ID matching that of the link's fragment e.g. `http://server.com#about` must contain an element with `id="about"` or the link is considered broken.
- The link redirects more than 5 times consecutively.**Note**: Not all link types are supported.
In a nutshell, only HTTP(S) based links can be successfully verified by `broken_link_finder`. As a result some links on a page might be (recorded and) ignored. You should verify these links yourself manually. Examples of unsupported link types include `tel:*`, `mailto:*`, `ftp://*` etc.
See the [usage](#Usage) section below on how to check which links have been ignored during a crawl.
With that said, the usual array of HTTP URL features are supported including anchors/fragments, query strings and IRI's (non ASCII based URL's).
## Made Possible By
`broken_link_finder` relies heavily on the `wgit` Ruby gem by the same author. See its [repository](https://github.com/michaeltelford/wgit) for more details.
## Installation
Only MRI Ruby is tested and supported, but `broken_link_finder` may work with other Ruby implementations.
Currently, the required MRI Ruby version is:
`ruby '>= 2.6', '< 4'`
### Using Bundler
$ bundle add broken_link_finder
### Using RubyGems
$ gem install broken_link_finder
### Verify
$ broken_link_finder version
## Usage
You can check for broken links via the executable or library.
### Executable
Installing this gem installs the `broken_link_finder` executable into your `$PATH`. The executable allows you to find broken links from your command line. For example:
$ broken_link_finder crawl http://txti.es
Adding the `--recursive` flag would crawl the entire `txti.es` site, not just its index page.
See the [output](#Output) section below for an example of a site with broken links.
You can peruse all of the available executable flags with:
$ broken_link_finder help crawl
### Library
Below is a simple script which crawls a website and outputs its broken links to `STDOUT`:
> main.rb
```ruby
require 'broken_link_finder'finder = BrokenLinkFinder.new
finder.crawl_site 'http://txti.es' # Or use Finder#crawl_page for a single webpage.
finder.report # Or use Finder#broken_links and Finder#ignored_links
# for direct access to the link Hashes.
```Then execute the script with:
$ ruby main.rb
See the full source code documentation [here](https://www.rubydoc.info/gems/broken_link_finder).
## Output
If broken links are found then the output will look something like:
```text
Crawled http://txti.es
7 page(s) containing 32 unique link(s) in 6.82 secondsFound 6 unique broken link(s) across 2 page(s):
The following broken links were found on 'http://txti.es/about':
http://twitter.com/thebarrytone
/doesntexist
http://twitter.com/nwbld
twitter.com/txtiesThe following broken links were found on 'http://txti.es/how':
http://en.wikipedia.org/wiki/Markdown
http://imgur.comIgnored 3 unique unsupported link(s) across 2 page(s), which you should check manually:
The following links were ignored on 'http://txti.es':
tel:+13174562564
mailto:[email protected]The following links were ignored on 'http://txti.es/contact':
ftp://server.com
```You can provide the `--html` flag if you'd prefer a HTML based report.
## Link Extraction
You can customise the XPath used to extract links from each crawled page. This can be done via the executable or library.
### Executable
Add the `--xpath` (or `-x`) flag to the crawl command e.g.
$ broken_link_finder crawl http://txti.es -x //img/@src
### Library
Set the desired XPath using the accessor methods provided:
> main.rb
```ruby
require 'broken_link_finder'# Set your desired xpath before crawling...
BrokenLinkFinder::link_xpath = '//img/@src'# Now crawl as normal and only your custom targeted links will be checked.
BrokenLinkFinder.new.crawl_page 'http://txti.es'# Go back to using the default provided xpath as needed.
BrokenLinkFinder::link_xpath = BrokenLinkFinder::DEFAULT_LINK_XPATH
```## Contributing
Bug reports and feature requests are welcome on [GitHub](https://github.com/michaeltelford/broken-link-finder). Just raise an issue.
## License
The gem is available as open source under the terms of the [MIT License](http://opensource.org/licenses/MIT).
## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run `bundle exec rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run `bundle exec rake install`.
To release a new gem version:
- Update the deps in the `*.gemspec`, if necessary.
- Update the version number in `version.rb` and add the new version to the `CHANGELOG`.
- Run `bundle install`.
- Run `bundle exec rake test` ensuring all tests pass.
- Run `bundle exec rake compile` ensuring no warnings.
- Run `bundle exec rake install && rbenv rehash`.
- Manually test the executable.
- Commit any changes and merge your dev branch into `master`. Push `master` to `origin`.
- Run `bundle exec rake release[origin]`.