https://github.com/joshbeard/link-validator
CLI and GitHub Action for checking broken links
https://github.com/joshbeard/link-validator
Last synced: 12 months ago
JSON representation
CLI and GitHub Action for checking broken links
- Host: GitHub
- URL: https://github.com/joshbeard/link-validator
- Owner: joshbeard
- License: mit
- Created: 2025-05-27T18:14:33.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-06-27T16:49:53.000Z (12 months ago)
- Last Synced: 2025-06-30T03:49:00.736Z (12 months ago)
- Language: Go
- Homepage:
- Size: 244 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Link Validator
A tool and GitHub Action to check for broken links (4xx, 5xx status codes) on websites.
It supports both sitemap-based checking and web crawling.
## Features
- **Sitemap Support**: Check links from XML sitemaps
- **Website Crawling**: Recursively crawl websites to discover links
- **Concurrent Processing**: Configurable concurrent request limits for performance
- **Flexible Configuration**: Support for both command-line flags and environment variables
- **Pattern Exclusion**: Exclude URLs using regex patterns
- **GitHub Action Integration**: Built-in support for GitHub Actions with proper outputs
- **Dynamic URL Resolution**: Intelligent base URL detection using HTTP Content-Type headers
- **Comprehensive Reporting**: Detailed results with status codes, errors, and timing information
- **Help and Version Support**: Built-in help and version information
## Installation & Usage
### GitHub Action
Use directly in your GitHub workflows:
```yaml
- name: Check links
uses: joshbeard/gh-action-link-checker@v1
with:
sitemap-url: 'https://example.com/sitemap.xml'
```
### Docker Image
Available on GitHub Container Registry and Docker Hub:
```bash
# From GitHub Container Registry
docker run --rm ghcr.io/joshbeard/link-checker:latest \
--sitemap-url https://example.com/sitemap.xml
# From Docker Hub
docker run --rm joshbeard/link-checker:latest \
--sitemap-url https://example.com/sitemap.xml
```
### Binary Releases
Download pre-built binaries from [GitHub Releases](https://github.com/joshbeard/gh-action-link-checker/releases):
```bash
curl -L https://github.com/joshbeard/gh-action-link-checker/releases/latest/download/link-checker-linux-amd64 -o link-checker
chmod +x link-checker
./link-checker --sitemap-url https://example.com/sitemap.xml
```
### Getting Help
```bash
# Show help information
./link-checker --help
# Show version information
./link-checker --version
```
## Examples
### GitHub Action - Sitemap
```yaml
name: Check Links
on:
schedule:
- cron: '0 0 * * 0' # Weekly on Sunday
workflow_dispatch:
jobs:
link-check:
runs-on: ubuntu-latest
steps:
- name: Check links from sitemap
uses: joshbeard/gh-action-link-checker@v1
with:
sitemap-url: 'https://example.com/sitemap.xml'
timeout: 30
max-concurrent: 10
exclude-patterns: '.*\.pdf$,.*example\.com.*'
```
### GitHub Action - Web Crawling
```yaml
name: Check Links
on:
push:
branches: [main]
jobs:
link-check:
runs-on: ubuntu-latest
steps:
- name: Check links by crawling
uses: joshbeard/gh-action-link-checker@v1
with:
base-url: 'https://example.com'
max-depth: 3
timeout: 30
max-concurrent: 5
fail-on-error: true
```
### GitLab CI
```yaml
link-check:
stage: test
image: ghcr.io/joshbeard/link-checker:latest
script:
- link-checker --sitemap-url https://example.com/sitemap.xml --timeout 30 --max-concurrent 5
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
```
### Docker with Custom Configuration
```bash
docker run --rm ghcr.io/joshbeard/link-checker:latest \
--base-url https://example.com \
--max-depth 2 \
--timeout 60 \
--exclude-patterns ".*\.pdf$,.*\.zip$" \
--verbose
```
### Complete GitHub Action with Error Handling
```yaml
name: Link Checker
on:
schedule:
- cron: '0 2 * * 1' # Weekly on Monday at 2 AM
workflow_dispatch:
jobs:
check-links:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Check website links
id: link-check
uses: joshbeard/gh-action-link-checker@v1
with:
sitemap-url: 'https://example.com/sitemap.xml'
timeout: 30
user-agent: 'MyBot/1.0'
exclude-patterns: '.*\.pdf$,.*\.zip$,.*example\.com.*'
max-concurrent: 10
fail-on-error: false
- name: Comment on PR if broken links found
if: steps.link-check.outputs.broken-links-count > 0
uses: actions/github-script@v7
with:
script: |
const brokenLinks = JSON.parse('${{ steps.link-check.outputs.broken-links }}');
const count = '${{ steps.link-check.outputs.broken-links-count }}';
let comment = `## 🔗 Link Check Results\n\n`;
comment += `Found ${count} broken link(s):\n\n`;
brokenLinks.forEach(link => {
comment += `- ❌ [${link.url}](${link.url}) - ${link.error}\n`;
});
console.log(comment);
```
## Configuration
### Inputs (GitHub Action)
| Input | Description | Required | Default |
|-------|-------------|----------|---------|
| `sitemap-url` | URL to sitemap.xml to check links from | No | - |
| `base-url` | Base URL to crawl for links (used if sitemap-url not provided) | No | - |
| `max-depth` | Maximum crawl depth when using base-url | No | `3` |
| `timeout` | Request timeout in seconds | No | `30` |
| `user-agent` | User agent string for requests | No | `GitHub-Action-Link-Checker/1.0` |
| `exclude-patterns` | Comma-separated list of URL patterns to exclude (regex supported) | No | - |
| `fail-on-error` | Whether to fail the action if broken links are found | No | `true` |
| `max-concurrent` | Maximum number of concurrent requests | No | `10` |
| `verbose` | Show detailed output for each link checked | No | `false` |
### Command Line Flags
When using the binary or Docker image, use these flags:
```bash
-sitemap-url string URL to sitemap.xml
-base-url string Base URL to crawl
-max-depth int Maximum crawl depth (default 3)
-timeout int Request timeout in seconds (default 30)
-user-agent string User agent string (default "GitHub-Action-Link-Checker/1.0")
-exclude-patterns string Comma-separated exclude patterns
-max-concurrent int Max concurrent requests (default 10)
-fail-on-error Exit with error code if broken links found (default true)
-verbose Show detailed output
-help Show help information
-version Show version information
```
### Environment Variables
The tool supports environment variables (primarily for GitHub Action integration):
```bash
INPUT_SITEMAP_URL URL of the sitemap to check
INPUT_BASE_URL Base URL to start crawling from
INPUT_MAX_DEPTH Maximum crawl depth (default: 3)
INPUT_TIMEOUT Request timeout in seconds (default: 30)
INPUT_USER_AGENT User agent string (default: Link-Validator/1.0)
INPUT_EXCLUDE_PATTERNS Comma-separated regex patterns to exclude URLs
INPUT_FAIL_ON_ERROR Exit with error code if broken links found (default: true)
INPUT_MAX_CONCURRENT Maximum concurrent requests (default: 10)
INPUT_VERBOSE Enable verbose output (default: false)
```
**Note**: Command line flags take precedence over environment variables.
### Outputs (GitHub Action)
| Output | Description |
|--------|-------------|
| `broken-links-count` | Number of broken links found |
| `broken-links` | JSON array of broken links with details |
| `total-links-checked` | Total number of links checked |
## Advanced Usage
### Using Environment Variables
You can use environment variables instead of command line flags:
```bash
# Check links from sitemap using environment variables
INPUT_SITEMAP_URL=https://example.com/sitemap.xml ./link-checker
# Crawl website using environment variables
INPUT_BASE_URL=https://example.com INPUT_MAX_DEPTH=2 INPUT_VERBOSE=true ./link-checker
```
### Exclude Patterns
You can exclude URLs using regex patterns:
```yaml
with:
exclude-patterns: '.*\.pdf$,.*\.zip$,.*example\.com.*,.*#.*'
```
This will exclude:
- PDF files
- ZIP files
- Any URLs containing "example.com"
- Any URLs with fragments (anchors)
### Rate Limiting
Control concurrent requests to be respectful to target servers:
```yaml
with:
max-concurrent: 5 # Only 5 concurrent requests
timeout: 60 # 60 second timeout per request
```
### Verbose Output
Enable detailed output to see each link as it's being checked:
```yaml
with:
verbose: true
```
This will show output like:
```
✅ [1/111] https://example.com/page1 (Status: 200, Duration: 45ms)
❌ [2/111] https://example.com/broken (Status: 404, Duration: 23ms)
🔄 [3/111] https://example.com/redirect (Status: 301, Duration: 67ms)
```
Status emojis:
- ✅ Success (2xx)
- 🔄 Redirect (3xx)
- ❌ Client Error (4xx)
- 💥 Server Error (5xx)
- ❓ Unknown/Error
## Development
### Building
```bash
go mod tidy
go build -o link-checker ./cmd/link-checker
```
Or use the Makefile:
```bash
make build # Build the binary
make test # Run tests
make help # See all available targets
```
### Testing
Run the test suite:
```bash
go test ./... # Run all tests
go test ./... -cover # Run with coverage
go test ./... -v # Verbose output
```
### Test Coverage
The project maintains high test coverage. To generate a coverage report:
```bash
go test -coverprofile=coverage.out ./...
go tool cover -html=coverage.out -o coverage.html
```
## Dynamic URL Resolution
The link checker uses intelligent URL resolution to properly handle relative links on web pages:
1. **HTML Base Tag Detection**: If a page contains a `` tag, it uses that as the base URL for resolving relative links.
2. **Dynamic Content-Type Analysis**: When no base tag is present, the tool makes HTTP HEAD requests to determine if a URL represents a file or directory based on the Content-Type header:
- **Directory-like content** (`text/html`, `application/json`, `application/xml`): Treats the URL as a directory for relative link resolution
- **File-like content** (`application/pdf`, `image/*`, `audio/*`, `video/*`, etc.): Uses the parent directory for relative link resolution
3. **Extension-based Fallback**: If HTTP detection fails, falls back to file extension analysis to determine URL type.
## License
MIT License - see LICENSE file for details.