{"id":13424858,"url":"https://github.com/DataHenHQ/till","last_synced_at":"2025-03-15T18:36:02.714Z","repository":{"id":47502885,"uuid":"340102737","full_name":"DataHenHQ/till","owner":"DataHenHQ","description":"DataHen Till is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.","archived":false,"fork":false,"pushed_at":"2021-12-05T13:47:38.000Z","size":2137,"stargazers_count":815,"open_issues_count":1,"forks_count":23,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-02-08T07:53:28.921Z","etag":null,"topics":["crawler","man-in-the-middle","mitm","proxy-server","scraper","scraping","web-scraping"],"latest_commit_sha":null,"homepage":"https://till.datahen.com","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DataHenHQ.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-02-18T16:04:02.000Z","updated_at":"2025-01-08T07:14:06.000Z","dependencies_parsed_at":"2022-08-26T05:10:36.085Z","dependency_job_id":null,"html_url":"https://github.com/DataHenHQ/till","commit_stats":null,"previous_names":[],"tags_count":37,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataHenHQ%2Ftill","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataHenHQ%2Ftill/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataHenHQ%2Ftill/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DataHenHQ%2Ftill/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DataHenHQ","download_url":"https://codeload.github.com/DataHenHQ/till/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243775942,"owners_count":20346293,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","man-in-the-middle","mitm","proxy-server","scraper","scraping","web-scraping"],"created_at":"2024-07-31T00:01:00.211Z","updated_at":"2025-03-15T18:35:57.700Z","avatar_url":"https://github.com/DataHenHQ.png","language":"Go","funding_links":[],"categories":["Go"],"sub_categories":[],"readme":"\n\u003cimg align=\"left\" width=\"150\" style=\"padding:6px 23px 6px 0px;\" src=\"img/till-logo.svg\"\u003e **DataHen Till** is a companion tool to your existing web scraper that instantly makes it scalable, maintainable, and more unblockable, with minimal code changes on your scraper. Integrates with any scraper in 5 minutes.\n\n[![Alt text](https://img.youtube.com/vi/D1VBVYTRo8g/0.jpg)](https://www.youtube.com/watch?v=D1VBVYTRo8g)\n\nTill was architected to follow best practices that [DataHen](https://www.datahen.com) has accumulated over the years of scraping at a massive scale.\n\n![How it works](img/how-it-works.png)\n\n### Till easily integrates with your existing scrapers...\nwritten in languages such as:\n\n\u003cimg align=\"left\"  height=\"50\" title=\"Python\" style=\"padding:6px 23px 6px 0px;\" src=\"img/integrations/python.png\"\u003e\n\n\u003cimg align=\"left\" height=\"50\" title=\"Nodejs\" style=\"padding:6px 23px 6px 0px; \" src= \"img/integrations/nodejs.png\"\u003e\n\u003cimg align=\"left\" height=\"50\" title=\"Ruby\" style=\"padding:6px 23px 6px 0px;\" src= \"img/integrations/ruby.png\"\u003e\n\n\u003cimg align=\"left\" height=\"50\" title=\"Go\" style=\"padding:6px 23px 6px 0px;\" src= \"img/integrations/go.png\"\u003e\n\n\u003cimg height=\"50\" title=\"Java\" style=\"padding:6px 23px 6px 0px;\" src= \"img/integrations/java.png\"\u003e\n\nor frameworks such as:\n\n\u003cimg align=\"left\" height=\"50\" title=\"Scrapy\" style=\"padding:6px 23px 6px 0px;\" src= \"img/integrations/scrapy.png\"\u003e\n\u003cimg align=\"left\" height=\"80\" title=\"Puppeteer\" style=\"padding:6px 23px 6px 0px;\" src= \"img/integrations/puppeteer.png\"\u003e\n\n\u003cimg align=\"left\" height=\"50\" title=\"Kimurai\" style=\"padding:6px 23px 6px 0px;\" src= \"img/integrations/kimurai.png\"\u003e\n\u003cimg align=\"left\" height=\"50\" title=\"Colly\" style=\"padding:6px 23px 6px 0px;\" src= \"img/integrations/colly.png\"\u003e\n\u003cimg align=\"left\" height=\"50\" title=\"Selenium\" style=\"padding:6px 23px 6px 0px;\" src= \"img/integrations/selenium.png\"\u003e\n\n\u003cbr clear=\"left\"/\u003e\n\n\nand many more...\n\n\n# Table of Contents\n\n* [Problems with Web Scraping](#problems-with-web-scraping)\n    * [Scaling Your Scraper](#scaling-your-scraper)\n    * [Blocked scraper](#blocked-scraper)\n    * [Scraper Maintenance](#scraper-maintenance)\n    * [Postmortem analysis \u0026 reproducability](#postmortem-analysis--reproducability)\n    * [Starting over from scratch when it fails mid-way](#starting-over-from-scratch-when-it-fails-mid-way)\n* [Features](#features)\n    * [User-Agent randomizer](#user-agent-randomizer)\n    * [Proxy IP address rotation](#proxy-ip-address-rotation)\n    * [Sticky Sessions](#sticky-sessions)\n    * [Managing Cookies](#managing-cookies)\n    * [Request Logging](#request-logging)\n    * [HTTP Caching](#http-caching)\n    * [Global ID (GID)](#global-id-gid)\n    * [Request Interceptions](#request-interceptions)\n* [How DataHen Till works](#how-datahen-till-works)\n* [Installation](#installation)\n* [Certificate Authority (CA) Certificates](#certificate-authority-ca-certificates)\n* [Till Integrations](#till-integrations)\n    * [Python](#python)\n        * Scrapy\n    * [Node.js](#nodejs)\n        * Plain\n        * Puppeteer\n    * [Go](#go)\n        * net/http\n        * Colly\n    * [Ruby](#ruby)\n        * Kimurai\n\n# Problems with Web Scraping\n\n\nWeb scraping is usually easy to get started, especially on a small scale. However, as you try to scale it up, it gets exponentially difficult. Scraping 10,000 records can easily be done with simple web scraper scripts in any programming language, but as you try to scrape millions of pages, you would need to architect and build features on your web scraping script that allows you to scale, maintain and unblock your scrapers. \n\n\n**DataHen Till** solves the following problems:\n\n\n## Scaling your scraper\nScraping to millions or even billions of records requires much more pre-planning. It's not simply running your existing web scraper script in a bigger CPU/Ram machine. \nMore thoughts are needed, such as: \n\n- How to log massive amounts of HTTP requests. \n- How to troubleshoot HTTP requests, when it fails at scale.\n- How to minimize bandwidth usage. \n- How to rotate proxy IPs.\n- How to handle anti-scrapers.\n- What happens when a scraper fails.\n- How to resume scrapers after they are fixed.\n- etc.\n\n\nTill provides a plug-and-play method of making your web scrapers scalable, and maintainable following best practices at [DataHen](https://www.datahen.com) that makes web scraping a pleasant experience. \n\n## Blocked scraper\nAs you try to scale up the number of requests, quite often, the target websites will detect your scraper and try to block your requests using Captcha, or throttling, or denying your request completely. \n\nTill helps you circumvent detected as a web scraper by identifying your scraper as a real web browser. It does this by generating random `user-agent` headers and randomizing proxy IPs (that you supply) on every HTTP request. \n\nTill also makes it easy for you to troubleshoot on why the target website block your scraper.\n\n## Scraper Maintenance\nMaintaining high-scale scrapers is challenging due to the massive volume of requests and interactions between your scrapers and the target websites. In order for a smooth operation, you need to think through how to maintain your scrapers regularly. \n\nYou need to know how to raise and triage errors as they occur on your scrapers, not all errors on web scraping should be treated equally. some are ignorable, and some are urgent. So, you will need to know what will be the details of your \"development-deployment-maintenance\" process will be.\n\nTill solves this by logging all your HTTP requests and categorizing them whether it was successful (2XX statuses) or failures(non 2XX statuses). Till also provides a Web UI to analyze the request history and make sense of what happened during your scraping process.\n\nTill makes it even easier for scraper maintenance by assigning each request with a unique Global ID (GID) that is derived from the request's URL, method, body, etc. You can then use this GID to troubleshoot your scrapers on where it went wrong.\n\n## Postmortem analysis \u0026 reproducability\nThe biggest difficulty facing any web scraper developer is when there are scraping failures. Your scraper fails when fetching or parsing certain URLs, but when you look at the target website and URLs, everything looks fine. How do you troubleshoot what already happened in the scenario?. How do you reproduce that failed scrape so that you can fix the issue?\n\nTill stores all HTTP requests and the responses (including the response body/content) into a local cache. If at anytime your scraper encounters an error, you can then use the request's GID (Till assigns a Global ID, also called GID, on every request) to find the request and the actual response and content from the cache. In this way, you can analyze what went wrong with that particular request.\n\n## Starting over from scratch when it fails mid-way\nWebsites change all the time and without notice. Imagine running your web scraper for a week and then suddenly, somewhere along the way, it fails. It is frustrating that once you've fixed the scraper, there is a high chance that you'd need to start over from scratch again. And, on top of this, there are additional consequences, such as time delay, and further charges related to proxy usage, bandwidth, storage, VM costs, etc. \n\nTill solves this by allowing you to replay your scrapers without actually needing to resend the HTTP requests to the target server.\nTill does this by assigning each HTTP request its own unique Global ID (GID) that is generated from the request's URL, method, headers, etc. It then stores all HTTP responses in the Cache based on their GID.\n\nWhen you restart your scraper, the scraping process can go blazingly fast because Till now serves the cached version of the HTTP responses. All of this without any code changes on your existing web scraper.\n\n# Features\n\n\n\n## [User-Agent randomizer](https://till.datahen.com/docs/user-agent-randomizer)\nTill automatically generates random user-agent on every request. Choose to identify your scraper as a desktop browser, or a mobile browser, or you can even override it with your custom user-agent.\n\n## [Proxy IP address rotation](https://till.datahen.com/docs/proxy-ip-address-rotation)\nSupply a list of proxy IPs, and Till will randomly use them on every request. Saves you time in needing to set up a separate proxy rotation service.\n\n## [Sticky Sessions](https://till.datahen.com/docs/sticky-sessions)\nYour scraper can selectively reuse the same user-agent, proxy IP, and cookie jar for multiple requests. This allows you to easily group your requests based on certain workflow, and allow you to avoid detection from anti-scraping systems. \n\n## [Managing Cookies](https://till.datahen.com/docs/sticky-sessions#manage-cookies)\nNo need to build your cookie management logic in your scraper codes. Till can store the cookies for you so that you can easily reuse them on subsequent requests.\n\n\n## [Request Logging](https://till.datahen.com/docs/request-log)\nTill will log your requests based on successful request (2XX status code) or failed request (non 2XX status code). This will allow you to easily troubleshoot your scraper later. \n\nThe Till UI allows you to make sense of HTTP request history, and troubleshoot what happens during a scraping session.\n\n\n## [HTTP Caching](https://till.datahen.com/docs/http-caching)\nTill caches all of your HTTP responses (and their contents), so that as needed, your web scraper will reuse the cache without needing to do another HTTP request to the target server. \n\nYou can selectively choose whether to use a particular cached content or not by specifying how fresh you want Till to serve the cache. For example: If Till holds an existing cached content that is 1 week old, but your web scraper only wants 1-day old content, Till will then only serve cached contents that are 1 day old.\n\n![HTTP Caching Flowchart](img/http-caching-flowchart.png)\n\n## [Global ID (GID)](https://till.datahen.com/docs/http-caching#gid)\nTill uses [DataHen Platform](https://www.datahen.com/platform)'s convention of marking every unique request with a signature (we call this the Global ID or GID for short). Think of it like a Checksum of the actual request. \n\nAnytime your scraper sends a request through Till, it will return a response with the header `X-DH-GID` that contains the GID. This GID allows you to easily troubleshoot requests when you need to look up specific requests in the log, or contents in the cache.\n\n## [Request Interceptions](https://till.datahen.com/docs/request-interception)\nTill can intercept any HTTP request of your choice, and replace with any HTTP response. \n\nThe following are some examples of useful scenarios:\n\n- Ignoring Google Analytics javascript\n- Ignoring images or other files\n- Replacing (stubbing) an API call with a different response\n- Restricting your scraper to only certain URL patterns.\n\n\n# How DataHen Till works\n\nTill works as a Man In The Middle (MITM) proxy that listens to incoming HTTP(S) requests and forwards those requests to the target server as needed. While it does so, it enhances each request to avoid being detected by anti-scrapers. It also logs and caches the responses to make your scraper maintainable and scalable.\n\nConnect your scraper to Till via the `proxy` protocol that is typically common in any programming language.\n\nYour scraper will then continue to run as-is and it will get instantly become more unblockable, scalable, and maintainable.\n\n![How it works](img/how-it-works.png)\n\n# Installation\n\n## Step 1: Download Till\n\nThe recommended way to install DataHen Till is by downloading one of the [standalone binaries](https://github.com/DataHenHQ/till/releases) according to your OS.\n\n\n## Step 2: Get your auth Token\n\nYou need to get your auth token to run Till.\n\nGet your token for FREE by signing up for an account at [till.datahen.com](https://till.datahen.com).\n\n\n## Step 3: Start Till\n\nstart the Till server with the following command:\n```bash\n$ till serve -t \u003cyour token here\u003e \n```\nThe above will start a proxy port on [http://localhost:2933](http://localhost:2933)\nand the Till UI on [http://localhost:2980](http://localhost:2980).\n\n![Request Log UI](img/request-log-ui.png)\n\n## Step 4 Connect to Till\n\nYou can connect your scraper to Till without many code changes. \n\nIf you want to connect to Till using curl, this is how:\n\n\n```bash\n$ curl -k --proxy http://localhost:2933 https://fetchtest.datahen.com/echo/request\n```\n\n\n\n# Certificate Authority (CA) Certificates\nTill decrypts and encrypts HTTPS traffic on the fly between your scraper and the target websites.  In order to do so, your scraper (or browser) must be able to trust the built-in Certificate Authority (CA). This means the CA certificate that Till generates for you, needs to be installed on the computer where the scraper is running.\n\n**Note:** If you do not wish to install the CA certificate, you can still have your scraper connect to the Till server by disabling/ignoring security checks in your scraper. Please refer to the programming language/framework/tool that your scraper uses.\n\n## Installing the generated CA certificates onto your computer\nThe first time Till runs as a server, Till generates the CA certificates in the following directory: \n\nLinux or MacOS:\n```\n~/.config/datahen/till/\n```\n\nWindows:\n```\nC:\\Users\\\u003cyour user\u003e\\.config\\datahen\\till\\\n```\nThen, please follow the following instructions to install the CA certificates:\n### MacOS\n\n[Add certificates to a keychain using Keychain Access on Mac](https://support.apple.com/en-ca/guide/keychain-access/kyca2431/mac)\n\n### Ubuntu/Debian\n[How do I install a root certificate](https://askubuntu.com/questions/73287/how-do-i-install-a-root-certificate/94861#94861)\n\n### Mozilla Firefox\n[how to import the Mozilla Root Certificate into your Firefox web browser](https://wiki.mozilla.org/MozillaRootCertificate#Mozilla_Firefox)\n\n### Chrome\n[Getting Chrome to accept self-signed localhost certificate](https://stackoverflow.com/questions/7580508/getting-chrome-to-accept-self-signed-localhost-certificate/15076602#15076602)\n\n### Windows\nUse `certutil` with the following command:\n\n```\ncertutil -addstore root \u003cpath to your CA cert file\u003e\n```\n\nRead more about [certutil](https://web.archive.org/web/20160612045445/http://windows.microsoft.com/en-ca/windows/import-export-certificates-private-keys#1TC=windows-7)\n\n\n\n# Till Integrations\n\n## Python\n\n### Scrapy\nThe [Scrapy example](examples/python/scrapy/) demonstrates how to integrate Till with Python's [Scrapy framework](https://github.com/scrapy/scrapy).\n\n\n## Node.js\n\n### Plain\nThe [Node.js example](examples/nodejs/plain/) demonstrates how to integrate Till with Node.js based scrapers.\n\n### Puppeteer\nThe [Puppeteer example](examples/nodejs/puppeteer/) demonstrates how to integrate Till with Puppeteer.\n\n## Go\n\n### net/http\nThe [Go net/http example](examples/go/standard) demonstrates how to integrate Till with Go's net/http standard library.\n\n### Colly\nThe [Go Colly example](examples/go/colly) demonstrates how to integrate Till with [Colly](https://github.com/gocolly/colly).\n\n## Ruby\n\n### Kimurai\nThe [Ruby's Kimurai framework example](examples/ruby/kimurai) demonstrates how to integrate Till with Ruby's [Kimurai framework](https://github.com/vifreefly/kimuraframework).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDataHenHQ%2Ftill","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDataHenHQ%2Ftill","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDataHenHQ%2Ftill/lists"}