https://github.com/dgrebb/conjob
Confluence scraper utilities for building llms.txt.
https://github.com/dgrebb/conjob
Last synced: about 1 year ago
JSON representation
Confluence scraper utilities for building llms.txt.
- Host: GitHub
- URL: https://github.com/dgrebb/conjob
- Owner: dgrebb
- Created: 2025-02-01T12:32:29.000Z (over 1 year ago)
- Default Branch: develop
- Last Pushed: 2025-02-04T13:20:02.000Z (over 1 year ago)
- Last Synced: 2025-03-23T23:44:33.753Z (about 1 year ago)
- Language: JavaScript
- Size: 35.2 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
Awesome Lists containing this project
README
# Conjob: Confluence to Markdown Scraper
A Node.js tool to scrape Confluence spaces and convert them to Markdown files while preserving the page hierarchy.
## Features
- 📚 Scrape entire Confluence spaces or individual spaces
- 🔄 Convert Confluence storage format to Markdown
- 📁 Preserve page hierarchy in directory structure
- 🔁 Handle rate limiting with exponential backoff
- 🔗 Maintain page relationships and ordering
## Quick Start
```bash
# Install dependencies
pnpm install
# Configure your Confluence instance
# Edit utils/index.js:
export const BASE_URL = "http://your-confluence-instance/rest/api";
export const ACCESS_TOKEN = "your-personal-access-token";
# Scrape all spaces
pnpm space:all
# Or scrape a specific space
pnpm space:single ENGINEERING
```
## Architecture Decisions
This project follows a documented decision-making process. Key architectural decisions:
1. [API Integration](docs/api-integration.md)
- Native fetch with backoff
- Centralized API client
- Type-safe responses
2. [File Structure](docs/file-structure.md)
- Feature-based organization
- Clear separation of concerns
- Consistent patterns
3. [Error Handling](docs/error-handling.md)
- Centralized error handling
- Retry mechanisms
- Consistent error messages
4. [CLI Interface](docs/cli-interface.md)
- Command-based interface
- Progress feedback
- Clear usage instructions
## Project Structure
```
.
├── scripts/ # CLI Commands
│ ├── all-spaces.js # Scrape all spaces
│ └── all-space-content.js # Scrape single space
├── utils/ # Shared Utilities
│ └── index.js # API client, helpers
└── docs/ # Documentation
├── api-examples.md
├── api-integration.md
├── cli-interface.md
├── error-handling.md
├── file-structure.md
```
## Development
```bash
# Format code
pnpm format
```
## Output Structure
The scraper creates a directory structure that mirrors your Confluence space:
```
confluence_markdown/
├── SPACE1/
│ ├── home/
│ │ ├── index.md (Space homepage)
│ │ └── Other Root Pages.md
│ └── Parent Page/
│ ├── index.md (Parent page content)
│ └── Child Page.md
└── SPACE2/
└── ...
```
## Configuration
Configure your Confluence instance in `utils/index.js`:
```javascript
export const BASE_URL = "http://your-confluence-instance/rest/api";
export const ACCESS_TOKEN = "your-personal-access-token";
export const OUTPUT_DIR = "confluence_markdown";
```
## Error Handling
The scraper handles several error cases:
- Rate limiting (429) with exponential backoff
- Network errors with retries
- Invalid space keys
- Missing configuration
- File system errors
## Contributing
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## License
ISC
## Acknowledgments
- [markdown-it](https://github.com/markdown-it/markdown-it) for Markdown conversion
- [jsdom](https://github.com/jsdom/jsdom) for HTML parsing