Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/phyks/bloomysearch
A javascript search engine for static websites.
https://github.com/phyks/bloomysearch
Last synced: about 1 month ago
JSON representation
A javascript search engine for static websites.
- Host: GitHub
- URL: https://github.com/phyks/bloomysearch
- Owner: Phyks
- Created: 2014-01-11T00:21:29.000Z (almost 11 years ago)
- Default Branch: master
- Last Pushed: 2014-11-08T17:42:43.000Z (about 10 years ago)
- Last Synced: 2024-06-11T17:59:04.170Z (5 months ago)
- Language: Python
- Size: 328 KB
- Stars: 2
- Watchers: 2
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
BloomJS
====
A javascript search engine for static websites.Have you ever dreamt of having a search engine on your static website ? BloomySearch implements a static index generation, when you generate your webpages, and a client-side JavaScript script which actually implements all the search logic. It downloads the index and performs the search query.
To preserve bandwith, the index is stored in a binary file, using BloomFilters, instead of using a JSON index as Lunr.JS does.
For full details about BloomySearch, please refer to this blog post.
## Basic idea
I have a static weblog, generated thanks to [Blogit](https://github.com/phyks/blogit, caution this code is ugly) and, as I only want to have html files on my server, I needed to find a way to enable users to search my blog.An index is generated by a Python script, upon generation of the pages, and is dynamically downloaded by the client when he wants to search for contents.
## Files
### Index generation (`index_generation/` folder)
* `generate_index.py`: Python script to generate the index (runs only at page generation) in a nice format for Javascript
* `pybloom.py`: Library to handle bloom filters in Python
* `stemmer.py`: Implementation of Porter Stemming algorithm in Python, from Vivake Gupta.### Example html search form
* `index.html`
* `js/bloom.js`: main JS code
* `js/bloomfilters.js`: JS library to use BloomFilters### Examples
* `samples/`: samples for testing purpose (taken from my blog articles)
## Data storing
One of the main problem was to transmit the binary data from the Python script to the JS script. I found [an article about handling binary data in JavaScript](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest/Sending_and_Receiving_Binary_Data) which helped me a lot.
Data from the python script is just the array of bloomfilters bitarray written as a binary file (`data/search_index`), which I open with JS. The list of articles is also written in JSON form in a specific file (`data/pages_index.json`).
Here's the format of the output from the python script:
* [16 bits] : number of articles (== number of bitarrays)
* for each bitarray:
* [16 bits] : length of the bitarray
* […] : the bitarray itself## Notes
* I got the idea while reading [this page](http://www.stavros.io/posts/bloom-filter-search-engine/?print) found on [Sebsauvage's shaarli](http://sebsauvage.net/links/). I searched a bit for code doing what I wanted and found these ones:
* https://github.com/olivernn/lunr.js
* https://github.com/reyesr/fullproofBut I wasn't fully satisfied by the first one, and I found the second one too heavy and complicated for my purpose, so I ended up coding this.
* This code is mainly a proof of concept. As such, it is not fully optimized (actually, I just tweaked until the resulted files and calculations could be considered "acceptable"). For those looking for more effective solutions, here are a few things I found while looking for information on the web:
* The stemming algorithm used may not be the most efficient one. People wanting to work with non-English languages or to optimize the overall computation of the index can easily move to a more effective algorithm. See [Wikipedia](http://en.wikipedia.org/wiki/Stemming) and [the stemming library in Python](https://pypi.python.org/pypi/stemming/1.0) which has C wrappers for best performances.
## License
TLDR; I don't give a damn to anything you can do using this code. It would just
be nice to quote where the original code comes from. All the included libraries
(pybloom and the stemming library) have their own license.* -----------------------------------------------------------------------------
* "THE NO-ALCOHOL BEER-WARE LICENSE" (Revision 42):
* Phyks ([email protected]) wrote this file. As long as you retain this notice
* you can do whatever you want with this stuff (and you can also do whatever
* you want with this stuff without retaining it, but that's not cool...). If we
* meet some day, and you think this stuff is worth it, you can buy me a
*beersoda in return.
* Phyks
* ------------------------------------------------------------------------------