{"id":29248422,"url":"https://github.com/joshbeard/link-validator","last_synced_at":"2025-07-04T00:06:37.458Z","repository":{"id":295871543,"uuid":"991503976","full_name":"joshbeard/link-validator","owner":"joshbeard","description":"CLI and GitHub Action for checking broken links","archived":false,"fork":false,"pushed_at":"2025-06-27T16:49:53.000Z","size":250,"stargazers_count":0,"open_issues_count":4,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-06-30T03:49:00.736Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joshbeard.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-27T18:14:33.000Z","updated_at":"2025-05-28T04:00:59.000Z","dependencies_parsed_at":"2025-05-27T20:29:26.618Z","dependency_job_id":"1e44b074-07dd-49af-a309-1fdc35703155","html_url":"https://github.com/joshbeard/link-validator","commit_stats":null,"previous_names":["joshbeard/gh-action-link-checker","joshbeard/link-validator"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/joshbeard/link-validator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshbeard%2Flink-validator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshbeard%2Flink-validator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshbeard%2Flink-validator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshbeard%2Flink-validator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joshbeard","download_url":"https://codeload.github.com/joshbeard/link-validator/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joshbeard%2Flink-validator/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263421922,"owners_count":23464049,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-07-04T00:06:36.253Z","updated_at":"2025-07-04T00:06:37.311Z","avatar_url":"https://github.com/joshbeard.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Link Validator\n\nA tool and GitHub Action to check for broken links (4xx, 5xx status codes) on websites.\nIt supports both sitemap-based checking and web crawling.\n\n## Features\n\n- **Sitemap Support**: Check links from XML sitemaps\n- **Website Crawling**: Recursively crawl websites to discover links\n- **Concurrent Processing**: Configurable concurrent request limits for performance\n- **Flexible Configuration**: Support for both command-line flags and environment variables\n- **Pattern Exclusion**: Exclude URLs using regex patterns\n- **GitHub Action Integration**: Built-in support for GitHub Actions with proper outputs\n- **Dynamic URL Resolution**: Intelligent base URL detection using HTTP Content-Type headers\n- **Comprehensive Reporting**: Detailed results with status codes, errors, and timing information\n- **Help and Version Support**: Built-in help and version information\n\n## Installation \u0026 Usage\n\n### GitHub Action\n\nUse directly in your GitHub workflows:\n\n```yaml\n- name: Check links\n  uses: joshbeard/gh-action-link-checker@v1\n  with:\n    sitemap-url: 'https://example.com/sitemap.xml'\n```\n\n### Docker Image\n\nAvailable on GitHub Container Registry and Docker Hub:\n\n```bash\n# From GitHub Container Registry\ndocker run --rm ghcr.io/joshbeard/link-checker:latest \\\n  --sitemap-url https://example.com/sitemap.xml\n\n# From Docker Hub\ndocker run --rm joshbeard/link-checker:latest \\\n  --sitemap-url https://example.com/sitemap.xml\n```\n\n### Binary Releases\n\nDownload pre-built binaries from [GitHub Releases](https://github.com/joshbeard/gh-action-link-checker/releases):\n\n```bash\ncurl -L https://github.com/joshbeard/gh-action-link-checker/releases/latest/download/link-checker-linux-amd64 -o link-checker\nchmod +x link-checker\n./link-checker --sitemap-url https://example.com/sitemap.xml\n```\n\n### Getting Help\n\n```bash\n# Show help information\n./link-checker --help\n\n# Show version information\n./link-checker --version\n```\n\n## Examples\n\n### GitHub Action - Sitemap\n\n```yaml\nname: Check Links\non:\n  schedule:\n    - cron: '0 0 * * 0'  # Weekly on Sunday\n  workflow_dispatch:\n\njobs:\n  link-check:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Check links from sitemap\n        uses: joshbeard/gh-action-link-checker@v1\n        with:\n          sitemap-url: 'https://example.com/sitemap.xml'\n          timeout: 30\n          max-concurrent: 10\n          exclude-patterns: '.*\\.pdf$,.*example\\.com.*'\n```\n\n### GitHub Action - Web Crawling\n\n```yaml\nname: Check Links\non:\n  push:\n    branches: [main]\n\njobs:\n  link-check:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Check links by crawling\n        uses: joshbeard/gh-action-link-checker@v1\n        with:\n          base-url: 'https://example.com'\n          max-depth: 3\n          timeout: 30\n          max-concurrent: 5\n          fail-on-error: true\n```\n\n### GitLab CI\n\n```yaml\nlink-check:\n  stage: test\n  image: ghcr.io/joshbeard/link-checker:latest\n  script:\n    - link-checker --sitemap-url https://example.com/sitemap.xml --timeout 30 --max-concurrent 5\n  rules:\n    - if: $CI_PIPELINE_SOURCE == \"schedule\"\n    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH\n```\n\n### Docker with Custom Configuration\n\n```bash\ndocker run --rm ghcr.io/joshbeard/link-checker:latest \\\n  --base-url https://example.com \\\n  --max-depth 2 \\\n  --timeout 60 \\\n  --exclude-patterns \".*\\.pdf$,.*\\.zip$\" \\\n  --verbose\n```\n\n### Complete GitHub Action with Error Handling\n\n```yaml\nname: Link Checker\non:\n  schedule:\n    - cron: '0 2 * * 1'  # Weekly on Monday at 2 AM\n  workflow_dispatch:\n\njobs:\n  check-links:\n    runs-on: ubuntu-latest\n    steps:\n      - name: Checkout\n        uses: actions/checkout@v4\n\n      - name: Check website links\n        id: link-check\n        uses: joshbeard/gh-action-link-checker@v1\n        with:\n          sitemap-url: 'https://example.com/sitemap.xml'\n          timeout: 30\n          user-agent: 'MyBot/1.0'\n          exclude-patterns: '.*\\.pdf$,.*\\.zip$,.*example\\.com.*'\n          max-concurrent: 10\n          fail-on-error: false\n\n      - name: Comment on PR if broken links found\n        if: steps.link-check.outputs.broken-links-count \u003e 0\n        uses: actions/github-script@v7\n        with:\n          script: |\n            const brokenLinks = JSON.parse('${{ steps.link-check.outputs.broken-links }}');\n            const count = '${{ steps.link-check.outputs.broken-links-count }}';\n\n            let comment = `## 🔗 Link Check Results\\n\\n`;\n            comment += `Found ${count} broken link(s):\\n\\n`;\n\n            brokenLinks.forEach(link =\u003e {\n              comment += `- ❌ [${link.url}](${link.url}) - ${link.error}\\n`;\n            });\n\n            console.log(comment);\n```\n\n## Configuration\n\n### Inputs (GitHub Action)\n\n| Input | Description | Required | Default |\n|-------|-------------|----------|---------|\n| `sitemap-url` | URL to sitemap.xml to check links from | No | - |\n| `base-url` | Base URL to crawl for links (used if sitemap-url not provided) | No | - |\n| `max-depth` | Maximum crawl depth when using base-url | No | `3` |\n| `timeout` | Request timeout in seconds | No | `30` |\n| `user-agent` | User agent string for requests | No | `GitHub-Action-Link-Checker/1.0` |\n| `exclude-patterns` | Comma-separated list of URL patterns to exclude (regex supported) | No | - |\n| `fail-on-error` | Whether to fail the action if broken links are found | No | `true` |\n| `max-concurrent` | Maximum number of concurrent requests | No | `10` |\n| `verbose` | Show detailed output for each link checked | No | `false` |\n\n### Command Line Flags\n\nWhen using the binary or Docker image, use these flags:\n\n```bash\n-sitemap-url string       URL to sitemap.xml\n-base-url string          Base URL to crawl\n-max-depth int            Maximum crawl depth (default 3)\n-timeout int              Request timeout in seconds (default 30)\n-user-agent string        User agent string (default \"GitHub-Action-Link-Checker/1.0\")\n-exclude-patterns string  Comma-separated exclude patterns\n-max-concurrent int       Max concurrent requests (default 10)\n-fail-on-error           Exit with error code if broken links found (default true)\n-verbose                 Show detailed output\n-help                    Show help information\n-version                 Show version information\n```\n\n### Environment Variables\n\nThe tool supports environment variables (primarily for GitHub Action integration):\n\n```bash\nINPUT_SITEMAP_URL         URL of the sitemap to check\nINPUT_BASE_URL            Base URL to start crawling from\nINPUT_MAX_DEPTH           Maximum crawl depth (default: 3)\nINPUT_TIMEOUT             Request timeout in seconds (default: 30)\nINPUT_USER_AGENT          User agent string (default: Link-Validator/1.0)\nINPUT_EXCLUDE_PATTERNS    Comma-separated regex patterns to exclude URLs\nINPUT_FAIL_ON_ERROR       Exit with error code if broken links found (default: true)\nINPUT_MAX_CONCURRENT      Maximum concurrent requests (default: 10)\nINPUT_VERBOSE             Enable verbose output (default: false)\n```\n\n**Note**: Command line flags take precedence over environment variables.\n\n### Outputs (GitHub Action)\n\n| Output | Description |\n|--------|-------------|\n| `broken-links-count` | Number of broken links found |\n| `broken-links` | JSON array of broken links with details |\n| `total-links-checked` | Total number of links checked |\n\n## Advanced Usage\n\n### Using Environment Variables\n\nYou can use environment variables instead of command line flags:\n\n```bash\n# Check links from sitemap using environment variables\nINPUT_SITEMAP_URL=https://example.com/sitemap.xml ./link-checker\n\n# Crawl website using environment variables\nINPUT_BASE_URL=https://example.com INPUT_MAX_DEPTH=2 INPUT_VERBOSE=true ./link-checker\n```\n\n### Exclude Patterns\n\nYou can exclude URLs using regex patterns:\n\n```yaml\nwith:\n  exclude-patterns: '.*\\.pdf$,.*\\.zip$,.*example\\.com.*,.*#.*'\n```\n\nThis will exclude:\n- PDF files\n- ZIP files\n- Any URLs containing \"example.com\"\n- Any URLs with fragments (anchors)\n\n### Rate Limiting\n\nControl concurrent requests to be respectful to target servers:\n\n```yaml\nwith:\n  max-concurrent: 5  # Only 5 concurrent requests\n  timeout: 60        # 60 second timeout per request\n```\n\n### Verbose Output\n\nEnable detailed output to see each link as it's being checked:\n\n```yaml\nwith:\n  verbose: true\n```\n\nThis will show output like:\n```\n✅ [1/111] https://example.com/page1 (Status: 200, Duration: 45ms)\n❌ [2/111] https://example.com/broken (Status: 404, Duration: 23ms)\n🔄 [3/111] https://example.com/redirect (Status: 301, Duration: 67ms)\n```\n\nStatus emojis:\n- ✅ Success (2xx)\n- 🔄 Redirect (3xx)\n- ❌ Client Error (4xx)\n- 💥 Server Error (5xx)\n- ❓ Unknown/Error\n\n## Development\n\n### Building\n\n```bash\ngo mod tidy\ngo build -o link-checker ./cmd/link-checker\n```\n\nOr use the Makefile:\n\n```bash\nmake build    # Build the binary\nmake test     # Run tests\nmake help     # See all available targets\n```\n\n### Testing\n\nRun the test suite:\n\n```bash\ngo test ./...              # Run all tests\ngo test ./... -cover       # Run with coverage\ngo test ./... -v           # Verbose output\n```\n\n### Test Coverage\n\nThe project maintains high test coverage. To generate a coverage report:\n\n```bash\ngo test -coverprofile=coverage.out ./...\ngo tool cover -html=coverage.out -o coverage.html\n```\n\n## Dynamic URL Resolution\n\nThe link checker uses intelligent URL resolution to properly handle relative links on web pages:\n\n1. **HTML Base Tag Detection**: If a page contains a `\u003cbase href=\"...\"\u003e` tag, it uses that as the base URL for resolving relative links.\n\n2. **Dynamic Content-Type Analysis**: When no base tag is present, the tool makes HTTP HEAD requests to determine if a URL represents a file or directory based on the Content-Type header:\n   - **Directory-like content** (`text/html`, `application/json`, `application/xml`): Treats the URL as a directory for relative link resolution\n   - **File-like content** (`application/pdf`, `image/*`, `audio/*`, `video/*`, etc.): Uses the parent directory for relative link resolution\n\n3. **Extension-based Fallback**: If HTTP detection fails, falls back to file extension analysis to determine URL type.\n\n## License\n\nMIT License - see LICENSE file for details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshbeard%2Flink-validator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoshbeard%2Flink-validator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoshbeard%2Flink-validator/lists"}