{"id":25268499,"url":"https://github.com/vmandic/tris-web-crawler","last_synced_at":"2026-05-20T10:11:28.479Z","repository":{"id":217214444,"uuid":"740714393","full_name":"vmandic/tris-web-crawler","owner":"vmandic","description":"Tris is a simple NodeJS web crawler tool to help you collect links from visited links of a website's domain.","archived":false,"fork":false,"pushed_at":"2024-02-10T00:30:04.000Z","size":1219,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-06-27T01:07:49.483Z","etag":null,"topics":["crawler","data-tools","nodejs","scraping","seo-tools","web-scraper"],"latest_commit_sha":null,"homepage":"https://tris.fly.dev","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"isc","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vmandic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-01-08T23:02:58.000Z","updated_at":"2024-02-10T00:25:32.000Z","dependencies_parsed_at":"2024-02-10T01:27:37.507Z","dependency_job_id":"4b1624c0-9a2f-43cd-acc0-a9a0e93bcca9","html_url":"https://github.com/vmandic/tris-web-crawler","commit_stats":null,"previous_names":["vmandic/tris-simple-spider-scraper"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmandic%2Ftris-web-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmandic%2Ftris-web-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmandic%2Ftris-web-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmandic%2Ftris-web-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vmandic","download_url":"https://codeload.github.com/vmandic/tris-web-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238454781,"owners_count":19475308,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","data-tools","nodejs","scraping","seo-tools","web-scraper"],"created_at":"2025-02-12T10:29:51.216Z","updated_at":"2026-05-20T10:11:23.456Z","avatar_url":"https://github.com/vmandic.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tris - A Simple Web Crawler 🕸🕷\n\n![Docker build](https://github.com/vmandic/tris-web-crawler/actions/workflows/docker-image.yml/badge.svg)\n\nDocker 🐋 images: https://hub.docker.com/repository/docker/vmandic/tris\n\n## Try it out online\n\n**Feel free to crawl first 100 links as much as you like: [https://tris.fly.dev](https://tris.fly.dev)**\n\nThe online tool is rate limited to three parallel crawler processes and will live stream back the results to the results page. To modify the target URL just make sure you change it in the address bar query string.\n\n## About Tris web crawler\n\n_Tris_ is a NodeJS tool which is in its core a very simple web crawler that allows you to recursively crawl a target domain and its HTML files for anchor elements and visit each of them only once and deliver the list of visited links with the associated HTTP status response code. Tris provides various customization options to tailor the crawling and scraping process according to your needs.\n\nWhether you're a developer, SEO professional, or data enthusiast, Tris provides a simple yet powerful solution to gather valuable insights from websites.\n\n\u003cimg src=\"./assets/tris-screenshot.jpg\" alt=\"Tris web browser screenshot of results page\" /\u003e\n\n## Who Can Benefit?\n\n### Developers\n\nTris is ideal for developers who need a quick and reliable way to extract links from a website, whether for indexing purposes, link analysis, or content mapping. Tris will help you find \"dead\" links ie. the non-HTTP 200 pages.\n\n### SEO Professionals\n\nSEO professionals can leverage Tris to gather valuable data about a website's structure, internal linking, and potential SEO opportunities. Ideal if a domain lacks a sitemap file because that is what Tris will create for you. Tris can be your sitemap generator.\n\n### Data Enthusiasts\n\nData enthusiasts seeking to explore and analyze the structure of websites can use Tris to collect link data and gain insights into a website's content hierarchy.\n\n## Features\n\n- **Customizable Settings**: Configure the crawler with various settings using environment variables.\n- **Timeout Handling**: Specify the timeout in milliseconds for each request.\n- **Path Depth Limitation**: Set the maximum depth of paths to be crawled.\n- **Randomized User Agents**: Provide a list of custom user-agent headers that are randomized between requests.\n- **Skip Words**: Skip links that contain specified skip words.\n- **Sorting Output**: Optionally sort the output file lines in ascending order.\n- **Delay Between Requests**: Introduce a delay between requests to avoid overloading the server.\n- **HTTP Status Codes**: Optionally include HTTP status codes in the output file.\n- **Include/Exclude Paths**: Filter links based on specified path patterns.\n- **Trim Ending Slash**: Control whether trailing slashes are removed from URLs.\n- **Exclude Query String and Fragment**: Optionally exclude query strings and fragments from URLs.\n- **Limit amount of requests**: Optionally limit the total amount of web requests to be sent.\n\n## Limitations\n\nWell, forget about pre-rendering SPA JavaScript based sites and forget about custom elements facilitating navigation functionality. Forget about passing advanced spam protection services like Cloudflare and similar. Those are some of the basic and usual constraints that Tris web crawler will currently not be able to surpass.\n\nTris web crawler expects that it wont get blocked as bot and that the URL it requests will serve an HTML page (with HTTP response status code 200) with `\u003ca href=\"URL here\"\u003e\u003c/a\u003e` elements that can be picked up and visited.\n\n## Setup\n\nPrerequisite is NodeJS \u003e= v14.17 (check out related [package.json](./package.json)).\n\n1. Clone the repository:\n\n```bash\ngit clone https://github.com/vmandic/tris-web-crawler.git\n```\n\n2. Yarn install:\n\n```bash\nyarn install\n```\n\n## How to use\n\nTris is intended (so far) to be used on your own local computer so you can configure it at your will. To run it please go through the prior setup and the guides below to run locally.\n\nYou can try the online version also,\n**feel free to crawl first 100 links as much as you like: [https://tris.fly.dev](https://tris.fly.dev)**\n\nBefore running locally you can set up the .env file first by looking at the [.env.example](./.env.example) file to set up the crawler configuration options.\n\n### From the terminal with the CLI\n\nThe results will be printed back directly to the terminal standard output.\n\n```bash\nyarn start:cli https://www.index.hr\n```\n\n### As a local HTTP server\n\nBy default the web app will be served on :8080, you can specify your own port as first parameter.\nThe following example will serve on the default 8080 port:\n\n```bash\nyarn start\n```\n\nThe server crawler link works by opening a web socket connection on the given web server port +1 (ie. if you selected port 7777 then the socket will be on 7778). The web socket is used to transmit live scraping results back to the web page so you can see results as they come back.\n\nStart on port 7777 by explicitly specifying it:\n\n```bash\nyarn serve 7777\n```\n\nAfter starting the web server navigate to:\u003cbr\u003e`/crawl?url={specify valid URL to start scraping from here}`\n\n## Debugging\n\nYou can debug the app in either CLI or web server run mode using the configuration in [`./vscode/launch.json`](./.vscode/launch.json). and vscode as your IDE and debugging tool.\n\nFor continuos development and web server restart you can use the following nodemon watch command that will restart the app on any code change:\n\n```bash\nyarn serve:w\n```\n\n## Configuration (Environment Variables)\n\nSetup the .env file by copying [.env.example](./.env.example):\n\n```bash\ncp .env.example .env\n```\n\n- `WEB_REQUESTS_LIMIT`: Set to limit the amount of requests (default: unlimited ie. 0).\n- `TIMEOUT_MS`: Set the timeout in milliseconds (default: 10ms).\n- `PATH_DEPTH`: Set the path depth limit (default: 3).\n- `USER_AGENTS`: Provide a list of custom user-agent headers (comma-delimited).\n- `USE_RANDOM_AGENTS_COUNT`: Overrides USER_AGENTS if specified, generates N random UAs, default 0.\n- `SKIP_WORDS`: Specify skip words to skip links (comma-delimited).\n- `SORT_FILE_OUTPUT`: Set to \"true\" to sort output lines in ascending order.\n- `DELAY_MS`: Introduce a delay between requests (default: 0).\n- `INCLUDE_PATH`: Specify a path pattern to include only matching paths.\n- `OUTPUT_HTTP_CODE`: Set to \"true\" to include HTTP status codes in the output.\n- `EXCLUDE_QUERY_STRING`: Set to \"true\" to exclude query strings from URLs (default: false).\n- `EXCLUDE_FRAGMENT`: Set to \"true\" to exclude fragments from URLs (default: false).\n\n## Author\n\nVedran Mandić\n\n## Why Tris?\n\nTris stands out as a simple yet effective solution for web scraping, providing a balance between customization and ease of use. As said in the begging of this document, whether you're a developer, SEO professional, or data enthusiast, Tris empowers you to gather valuable insights from websites with minimal setup and maximum flexibility.\n\nStart exploring the web with Tris today!\n\nFeel free to modify or extend it further based on your preferences!\n\n## License\n\nThis project is licensed under the [ISC License](LICENSE) - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvmandic%2Ftris-web-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvmandic%2Ftris-web-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvmandic%2Ftris-web-crawler/lists"}