https://github.com/freaky/title_fetcher

(Hopefully) robust website <title> extractor
https://github.com/freaky/title_fetcher

Last synced: about 1 month ago
JSON representation

(Hopefully) robust website <title> extractor

Host: GitHub
URL: https://github.com/freaky/title_fetcher
Owner: Freaky
Created: 2015-09-13T23:11:31.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2015-09-13T23:16:22.000Z (almost 11 years ago)
Last Synced: 2025-03-02T20:30:41.712Z (over 1 year ago)
Language: Ruby
Size: 141 KB
Stars: 0
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# TitleFetcher - Robust web title tag fetcher

TitleFetcher fetches title elements from websites.

require 'titlefetcher'

tf = TitleFetcher.new
tf.fetch "http://freshbsd.org" # => "FreshBSD - The latest BSD Commits"

It tries to guess charset and converts/scrubs to UTF-8, making some effort
to correct common Mojibake issues.

Default timeouts for socket operations are 10 seconds, and it gives up
searching a document for title tags after the first 128KiB. Documents
with content-types other than `text/*` and `application/{xml,xhtml}` are
not scanned.

TitleFetcher objects are expected to be safe to share across threads.

## Dependencies

* Ruby 2.0+
* charlock_holmes(-jruby)
* mojibake
* http.rb
* oga

## TODO

* Stricter per-request timeout.
* Factor in HTTP and HTML headers to charset detection.
* HEAD request to check Content-Type without inviting a bunch of other data.
* Fall back to searching for h1 tags or so.
* Tests.
* Documentation.
* Release as a gem.
* Use it in something.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/freaky/title_fetcher

Awesome Lists containing this project

README