https://github.com/Idee8/codecrawl

🌊 Turn entire codebases into LLM-ready data. Extract data, search, and llms.txt from any repo with a single API.
https://github.com/Idee8/codecrawl

ai embeddings llm rag

Last synced: 7 months ago
JSON representation

🌊 Turn entire codebases into LLM-ready data. Extract data, search, and llms.txt from any repo with a single API.

Host: GitHub
URL: https://github.com/Idee8/codecrawl
Owner: Idee8
License: agpl-3.0
Created: 2025-04-02T18:33:46.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-04-30T08:54:03.000Z (8 months ago)
Last Synced: 2025-04-30T09:53:40.682Z (8 months ago)
Topics: ai, embeddings, llm, rag
Language: TypeScript
Homepage:
Size: 1.56 MB
Stars: 47
Watchers: 1
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome - Idee8/codecrawl - 🌊 Turn entire codebases into LLM-ready data. Extract data, search, and llms.txt from any repo with a single API. (TypeScript)

README

Codecrawl

Empower your AI apps with clean data from any repository. Featuring advanced codebase file-trees, semantic search, llms.txt and data extraction capabilities.

## What is Codecrawl?

[Codecrawl](https://crawl.irere.dev?ref=github) is an API service that takes a repository URL, crawls it, and converts it into clean markdown or structured data, generate embeddings then store them in a vector database. We currently support only public codebases accessible on different codehosts like GitHub and Gitlab and give you clean data for each.

## How to use it?

We will provide an easy to use API with our hosted version. You can find the playground and documentation [here](https://crawl.irere.dev/playground). You can also self host the backend if you'd like.

Check out the following resources to get started:
- [x] **API**: [Documentation](#cooming-soon)
- [x] **SDKs**: [Node](https://github.com/Idee8/codecrawl/blob/main/packages/sdk)
- [ ] Want an SDK or Integration? Let us know by opening an issue.

To run locally, refer to guide [here](https://github.com/Idee8/codecrawl/blob/main/CONTRIBUTING.md).

### API Key

To use the API, you need to sign up on [Codecrawl](https://crawl.irere.dev) and get an API key.

### Features

- [**File Structure**](#filetree): Get repository file structure to feed to LLMs.
- [**LLMs.txt**](#llms.txt): Generate a Llms.txt to feed directly to any model
- [**Extract**](#indexing): Get structured data from single repo, multiple repo with the help AI.
- [**Search**](#search): Search repository content with semantic understanding
- [**Batch**](#batch-indexing-multiple-urls): Process multiple repositories simultaneously

### Powerful Capabilities
- **Multiple Output Formats**: Convert repository content to markdown, XML, plain text
- **Structured Data**: Extract metadata like file stats, token counts, and repository info
- **Advanced Search**: Find relevant files and content with semantic search
- **Repository Analytics**: Get insights on file sizes, token counts and top files
- **Scalable Processing**: Handle large codebases with configurable limits and batch operations
- **Clean Data**: Remove comments, empty lines and get compressed output as needed

### LLMs.txt

Generate a `llms.txt` file for a repository, optimized for feeding Language Model training. This endpoint initiates a job to create the `llms.txt` file and returns a job ID to track its progress.

```bash
curl -X POST https://api.irere.dev/v1/llmstxt \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://github.com/irere123/run-lang"
}'
```

Returns a job ID to check the status of the `llms.txt` generation.

```json
{
"success": true,
"id": "123-456-789"
}
```

### Check LLMs.txt Job

Check the status and retrieve the content of a `llms.txt` generation job using the job ID.

```bash
curl -X GET https://api.irere.dev/v1/llmstxt/123-456-789 \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY'
```

Returns the status and data of the `llms.txt` generation job.

```json
{
"success": true,
"status": "completed",
"data": {
"llmstxt": "Content of the llms.txt file..."
}
}
```

### Generate FileTree

Used to get the file tree of the whole repository using its URL. This returns plain tree for given repository.
```bash cURL
curl -X POST https://api.irere.dev/v1/tree \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://github.com/irere123/run-lang",
}'
```

### Check FileTree Job

Check the status and retrieve the content of a `/v1/tree` generation job using the job ID.

```bash
curl -X GET https://api.irere.dev/v1/tree/123-456-789 \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY'
```

Returns the status and data of the `/v1/tree` generation job.

```json
{
"success": true,
"status": "completed",
"data": {
"tree": "Repository filetree..."
}
}
```

## Contributing

We love contributions! Please read our [contributing guide](CONTRIBUTING.md) before submitting a pull request. If you'd like to self-host, refer to the [self-hosting guide](SELF_HOST.md).

_It is the sole responsibility of the end users to respect repositories' policies when indexing, searching and crawling with Codecrawl. Users are advised to adhere to the applicable privacy policies and Licenses prior to initiating any indexing activities. By default, Codecrawl respects the directives specified in the repository's .gitignore files when indexing. By utilizing Codecrawl, you expressly agree to comply with these conditions._

## Contributors

## Credits

Built with inspiration from [Firecrawl](https://github.com/mendableai/firecrawl). Special thanks to their contributors for pioneering web crawling.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/Idee8/codecrawl

Awesome Lists containing this project

README

Codecrawl