{"id":19030162,"url":"https://github.com/statico/aspen","last_synced_at":"2025-04-23T16:01:39.894Z","repository":{"id":19594546,"uuid":"22845049","full_name":"statico/aspen","owner":"statico","description":"🔎 📖 ✨ Custom, private search engine for text documents built with NextJS/React/ES6/ES7","archived":false,"fork":false,"pushed_at":"2025-03-20T23:58:40.000Z","size":1866,"stargazers_count":31,"open_issues_count":1,"forks_count":5,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-18T01:47:32.905Z","etag":null,"topics":["corpus","docker","elasticsearch","es6","es7","javascript","nextjs","plaintext","plaintext-documents","search","search-engine"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/statico.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2014-08-11T15:57:02.000Z","updated_at":"2025-03-20T23:58:44.000Z","dependencies_parsed_at":"2025-04-20T06:46:40.280Z","dependency_job_id":null,"html_url":"https://github.com/statico/aspen","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statico%2Faspen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statico%2Faspen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statico%2Faspen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/statico%2Faspen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/statico","download_url":"https://codeload.github.com/statico/aspen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250468272,"owners_count":21435451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["corpus","docker","elasticsearch","es6","es7","javascript","nextjs","plaintext","plaintext-documents","search","search-engine"],"created_at":"2024-11-08T21:16:45.568Z","updated_at":"2025-04-23T16:01:39.874Z","avatar_url":"https://github.com/statico.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Aspen\n\nAspen lets you search a large corpus of plain text files via the browser.\n\n[![license](https://img.shields.io/github/license/statico/aspen.svg?style=flat)](https://github.com/statico/aspen/blob/master/LICENSE)\n[![build](https://github.com/statico/aspen/actions/workflows/build.yml/badge.svg)](https://github.com/statico/aspen/actions/workflows/build.yml)\n\n[![example](https://imgur.com/30X4t9A.gif)](https://imgur.com/30X4t9A)\n\n- Powerful search query support through [Elasticsearch query string syntax](https://www.elastic.co/guide/en/elasticsearch/reference/1.7/query-dsl-query-string-query.html#query-string-syntax)\n- Performs some basic cleanup of plaintext data and can extract document titles\n- Responsive UI that works on mobile\n- Runs in [Docker](https://ghcr.io/statico/aspen)\n\n## Getting Started using Docker Compose\n\n### 1. Collect your documents\n\nPut all your files in one place, like `~/ebooks/`:\n\n```\n$ tree ~/ebooks\n/Users/ian/ebooks\n└── Project\\ Gutenberg/\n    ├── Beowulf.txt\n    ├── Dracula.txt\n    ├── Frankenstein.txt\n```\n\n### 2. Run Aspen \u0026 Elasticsearch\n\n```\n$ docker-compose up -d\nCreating network \"aspen_default\" with the default driver\nCreating elasticsearch ... done\nCreating aspen         ... done\n```\n\n### 3. Convert any non-plaintext (PDFs, MS Word) documents to plaintext\n\nUse the included `convert` utility, which wraps [Apache Tika](https://tika.apache.org), to convert them to plaintext. Pass it a filename relative to your data directory:\n\n```\n$ ls ~/ebooks\nProject Gutenberg Test.docx\n\n$ docker-compose run aspen convert Test.docx\nStarting elasticsearch ... done\nTest.docx doesn't exist, trying /data/Test.docx\nCreating /data/Test.txt...\n...\nOK\n\n$ ls ~/ebooks\nProject Gutenberg Test.docx         Test.txt\n```\n\n#### 4. Import content into Elasticsearch\n\nStart by resetting Elasticsearch to make sure everything is working:\n\n```\n$ docker-compose run aspen es-reset\nStarting elasticsearch ... done\nResults from DELETE: { acknowledged: true }\n✓ Done.\n```\n\nNow import all `.txt` documents. The `import` script will try to figure out the title of the document automatically:\n\n```\n$ docker-compose run aspen import\nStarting elasticsearch ... done\n→ Base directory is /app/public/data\n▲ Ignoring non-text path: Test.docx\n→ Test.txt → Test Document\n→ Project Gutenberg/Beowulf.txt → The Project Gutenberg EBook of Beowulf\n→ Project Gutenberg/Dracula.txt → The Project Gutenberg EBook of Dracula, by Bram Stoker\n→ Project Gutenberg/Frankenstein.txt → Project Gutenberg's Frankenstein, by Mary Wollstonecraft (Godwin) Shelley\n✓ Done!\n```\n\nYou can also run `import` with a directory or file name relative to the data directory. For example, `import Project\\ Gutenberg` or `import Project\\ Gutenberg\\Dracula.txt`.\n\n**Sometimes plaintext documents act strangely.** Maybe `bin/import` can't extract a title or maybe the search highlights are off. The file might have the wrong line endings or one of those annoying [UTF-8 BOM headers](https://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom). Try running [dos2unix](http://dos2unix.sourceforge.net/) on your text files to fix them.\n\n#### 5. Done!\n\nGo to http://localhost:3000/ and start searching!\n\n## Development Setup\n\n#### 1. Install dependencies\n\nIt's easiest to use Elasticsearch via [Docker](https://www.docker.com/).\n\nYou can get Node and Yarn via [Homebrew](https://brew.sh/) on Mac, or you can download [Node.js v8.5 or later](https://nodejs.org/en/download/) and `npm install -g yarn` to get Yarn.\n\nFor document conversation (`bin/convert`) you'll want:\n\n1. [OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF) - for turning image-only PDFs into PDFs with embedded text\n1. [Apache Tika](https://tika.apache.org/) - for converting most documents into text, like PDFs with embedded text\n1. [UnRTF](https://www.gnu.org/software/unrtf/) - better at converting RTF than Tika\n1. [Par](http://www.nicemice.net/par/) - for formatting plaintext documents\n\nOn macOS you can `brew install node tika unrtf par`.\n\n#### 2. Clone the repo\n\n```\n$ git clone git@github.com:statico/aspen.git\n$ cd aspen\n$ yarn install\n```\n\n#### 3. Set up Elasticsearch and import your data\n\nSee steps 1-4 in the above \"Using Docker\" section. In short, get your text files together in one place, set up Elasticsearch, and import them with the `bin/import` command.\n\n#### 4. Start the web app\n\nAspen is built using [Next.js](https://github.com/zeit/next.js/), which is Node + ES6 + Express + React + hot reloading + lots more. Simply run:\n\n```\n$ yarn run dev\n```\n\n...and go to http://localhost:3000\n\nIf you are working on `server.js` and want automatic server restarting, do:\n\n```\n$ yarn global add nodemon\n$ nodemon -w server.js -w lib -x yarn -- run dev\n```\n\n## Development Notes\n\n- This started as an Angular 1 + CoffeeScript example. I recently migrated it to use Next.js, ES6 and React. You can view a full diff [here](https://github.com/statico/aspen/compare/4af174d...next).\n- I'm still using Elasticsearch 1.7 because I haven't bothered to learn the newer versions.\n\n## Links\n\n- [Elasticsearch Guide](http://www.elasticsearch.org/guide/)\n- [Elasticsearch 1.7 Reference](https://www.elastic.co/guide/en/elasticsearch/reference/1.7/index.html)\n- [`tree` command](https://www.geeksforgeeks.org/tree-command-unixlinux/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstatico%2Faspen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstatico%2Faspen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstatico%2Faspen/lists"}