{"id":28965958,"url":"https://github.com/devflowinc/firecrawl-to-trieve","last_synced_at":"2025-06-24T07:10:42.254Z","repository":{"id":253867142,"uuid":"844769879","full_name":"devflowinc/firecrawl-to-trieve","owner":"devflowinc","description":"Demonstration of a Firecrawl-to-Trieve crawling-to-search pipeline.","archived":false,"fork":false,"pushed_at":"2024-09-11T09:47:08.000Z","size":1113,"stargazers_count":18,"open_issues_count":3,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-22T05:17:05.333Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/devflowinc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-19T23:52:22.000Z","updated_at":"2025-03-23T08:07:31.000Z","dependencies_parsed_at":"2024-08-23T01:22:13.948Z","dependency_job_id":null,"html_url":"https://github.com/devflowinc/firecrawl-to-trieve","commit_stats":null,"previous_names":["devflowinc/firecrawl-to-trieve"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/devflowinc/firecrawl-to-trieve","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devflowinc%2Ffirecrawl-to-trieve","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devflowinc%2Ffirecrawl-to-trieve/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devflowinc%2Ffirecrawl-to-trieve/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devflowinc%2Ffirecrawl-to-trieve/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/devflowinc","download_url":"https://codeload.github.com/devflowinc/firecrawl-to-trieve/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/devflowinc%2Ffirecrawl-to-trieve/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261624969,"owners_count":23186121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-24T07:10:41.567Z","updated_at":"2025-06-24T07:10:42.233Z","avatar_url":"https://github.com/devflowinc.png","language":"JavaScript","readme":"# firecrawl-to-trieve\nDemonstration of a Firecrawl-to-Trieve crawling-to-search pipeline.\n\nHere is general approach:\n\n- get the results from Firecrawl\n- transform the results into chunks\n- load the chunks into Trieve\n- tentative: suggestions.py to pull suggested queries from Trieve—via [/chunk/suggestions](https://docs.trieve.ai/api-reference/chunk/generate-suggested-queries)—and explore the retrieval results and the data (not discussed in the blog)\n\n## Setup\n\n- Setup your environment variables\n\n- Firecral API key\n- Trieve API key and dataset ID\n\n``` \ncp .env.dist .env\n```\n\n### Python\n\n\n- Setup your virtual environment\n\n```\npython3 -m venv .venv\nsource .venv/bin/activate\n```\n\n- Install requirements\n\n```\npip install -r requirements.txt\n```\n\n- Freeze requirements\n\n```\npip freeze \u003e requirements.txt\n```\n\n### Node\n\n- Install dependencies\n\n```\nyarn install\n```\n\n## Running the scripts\n\n### Firecrawl\n\n- requires: `FIRECRAWL_API_KEY` in `.env`\n\nUse Firecrawl to get the results of a crawl on the `crawl_url`, here: `https://signoz.io/docs/`.\n\nPython in `python/`\n```bash\npython run_firecrawl.py\n```\n\nNode in `node/`\n\n```bash\nyarn crawl\n```\n\nThis writes a json file (with a timestamp in the name) with the crawl results in a list. Key fields are the markdown itself, and then various metadata fields, including `ogUrl`, `ogTitle`, `description`, `pageStatusCode`, etc.\n\nExample filename: `crawl_results_2024-08-20T16-35-59.json`\n\nSee the example: `example_crawl_results_2024-08-20T16-35-59.json`\n\n### Transform: Cleaning, Chunking, and Configuring\n\nSee cleaning scripts: `python/cleaners.py` and `node/cleaners.js`\n\nRun the transform scripts:\n\nIn `python/`\n\n```bash\npython transform.py\n```\n\nOr in `node/`\n\n```bash\nyarn transform\n```\n\nWarning: While exploring the data to determine the chunking approach we noted it had a button click that toggles between contexts, so half the content so half of the content for the page is not in the markdown. We will just flag this for now, and we'll have to see if this issue appears elsewhere.\n\n### Loading\n\nWe can run it with `-c` to create chunks and `-u` to upsert chunks (update by tracking_id, ex. if you want to add chunks with a different split or revise your cleaning approach).\n\nIn `python/`\n\n```bash\npython load.py [-c | -u]\n```\n\nIn `node/`\n```bash\nyarn load [-c | -u]\n```\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevflowinc%2Ffirecrawl-to-trieve","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevflowinc%2Ffirecrawl-to-trieve","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevflowinc%2Ffirecrawl-to-trieve/lists"}