https://github.com/freaky/title_fetcher
(Hopefully) robust website <title> extractor
https://github.com/freaky/title_fetcher
Last synced: about 1 year ago
JSON representation
(Hopefully) robust website <title> extractor
- Host: GitHub
- URL: https://github.com/freaky/title_fetcher
- Owner: Freaky
- Created: 2015-09-13T23:11:31.000Z (over 10 years ago)
- Default Branch: master
- Last Pushed: 2015-09-13T23:16:22.000Z (over 10 years ago)
- Last Synced: 2025-01-13T07:32:07.886Z (over 1 year ago)
- Language: Ruby
- Size: 141 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TitleFetcher - Robust web title tag fetcher
TitleFetcher fetches title elements from websites.
require 'titlefetcher'
tf = TitleFetcher.new
tf.fetch "http://freshbsd.org" # => "FreshBSD - The latest BSD Commits"
It tries to guess charset and converts/scrubs to UTF-8, making some effort
to correct common Mojibake issues.
Default timeouts for socket operations are 10 seconds, and it gives up
searching a document for title tags after the first 128KiB. Documents
with content-types other than `text/*` and `application/{xml,xhtml}` are
not scanned.
TitleFetcher objects are expected to be safe to share across threads.
## Dependencies
* Ruby 2.0+
* charlock_holmes(-jruby)
* mojibake
* http.rb
* oga
## TODO
* Stricter per-request timeout.
* Factor in HTTP and HTML headers to charset detection.
* HEAD request to check Content-Type without inviting a bunch of other data.
* Fall back to searching for h1 tags or so.
* Tests.
* Documentation.
* Release as a gem.
* Use it in something.