Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/typesense/typesense-docsearch-scraper
A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
https://github.com/typesense/typesense-docsearch-scraper
algolia docsearch documentation search typesense
Last synced: about 5 hours ago
JSON representation
A fork of Algolia's awesome DocSearch Scraper, customized to index data in Typesense (an open source alternative to Algolia)
- Host: GitHub
- URL: https://github.com/typesense/typesense-docsearch-scraper
- Owner: typesense
- License: other
- Created: 2021-01-05T20:44:09.000Z (almost 4 years ago)
- Default Branch: master
- Last Pushed: 2024-12-09T20:30:52.000Z (13 days ago)
- Last Synced: 2024-12-15T15:11:22.219Z (7 days ago)
- Topics: algolia, docsearch, documentation, search, typesense
- Language: Python
- Homepage: https://typesense.org/docs/guide/docsearch.html
- Size: 893 KB
- Stars: 108
- Watchers: 5
- Forks: 39
- Open Issues: 19
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Typesense DocSearch scraper
This is a maintained fork of Algolia's awesome [DocSearch Scraper](https://github.com/algolia/docsearch-scraper), customized to index data in [Typesense](https://typesense.org).
You'd typically setup this scraper to run on your documentation site, and then use [typesense-docsearch.js](https://github.com/typesense/typesense-docsearch.js) to add a search bar to your site.
#### What is Typesense?
If you're new to Typesense, it is an **open source** search engine that is simple to use, run and scale, with clean APIs and documentation.
Think of it as an open source alternative to Algolia and an easier-to-use, batteries-included alternative to ElasticSearch. Get a quick overview from [this guide](https://typesense.org/guide/).
## Usage
Read detailed step-by-step instructions on how to configure and setup the scraper on Typesense's dedicated documentation site: https://typesense.org/docs/guide/docsearch.html
## Changelog
We use git tags to identify every release.
So to view the changelog for a release, you can compare tags using a GitHub link like this:
[https://github.com/typesense/typesense-docsearch-scraper/compare/0.8.0...0.9.0](https://github.com/typesense/typesense-docsearch-scraper/compare/0.8.0...0.9.0).
Remember to change the version numbers in the URL as needed.
## Compatibility
| typesense-docsearch-scraper | typesense-server |
| --- | --- |
| 0.5.0 | >= 0.22.1 |
| 0.4.x and below | >= 0.21.0 |## Development Workflow
This section only applies if you're making changes to this scraper itself. If you only need to run the scraper, see Usage instructions above.
#### Releasing a new version
Basic/abbreviated instructions:
```shellsession
$ pipenv shell
$ ./docsearch docker:build
$ git tag -a 0.2.1 -m "0.2.1"
$ ./docsearch deploy:scraper
$ git push --follow-tags
```Detailed instructions starting from a fresh Ubuntu Server 22.02:
```bash
# Install Docker:
# https://docs.docker.com/engine/install/ubuntu/
sudo apt update
sudo apt remove docker docker-engine docker.io containerd runc --yes
sudo apt install \
ca-certificates \
curl \
gnupg \
lsb-release \
--yes
sudo mkdir -m 0755 -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install \
docker-ce \
docker-ce-cli \
containerd.io \
docker-buildx-plugin \
docker-compose-plugin \
--yes
sudo docker run hello-world# Run Docker as a non-root user:
# https://www.digitalocean.com/community/questions/how-to-fix-docker-got-permission-denied-while-trying-to-connect-to-the-docker-daemon-socket
sudo usermod -aG docker ${USER}
exit
# (Relogin.)
docker run hello-world# Install dependencies for pyenv:
# https://github.com/pyenv/pyenv/wiki#suggested-build-environment
sudo apt update
sudo apt install \
build-essential \
curl \
libbz2-dev \
libffi-dev \
liblzma-dev \
libncursesw5-dev \
libreadline-dev \
libsqlite3-dev \
libssl-dev \
libxml2-dev \
libxmlsec1-dev \
llvm \
make \
tk-dev \
wget \
xz-utils \
zlib1g-dev \
--yes# Install pyenv:
# https://github.com/pyenv/pyenv#automatic-installer
curl https://pyenv.run | bash# Add pyenv to path:
echo >> ~/.bashrc
echo '# Adding pyenv' >> ~/.bashrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc# Install Python 3.11 inside pyenv:
pyenv install 3.11# Set the active version of Python:
pyenv local 3.11# Upgrade pip:
pip install --upgrade pip# Install pipenv:
pip install --user pipenv# There will be a warning:
# "The script virtualenv-clone is installed in '/home/[username]/.local.bin' which is not on PATH."
# Fix the warning by adding it to the PATH:
echo >> ~/.bashrc
echo '# Fixing pip warning' >> ~/.bashrc
echo 'PATH=$PATH:~/.local/bin' >> ~/.bashrc
source ~/.bashrc# Ensure that you are in the "typesense-docsearch-scraper" directory.
# Then, install the Python dependencies for this project:
pipenv --python 3.11
pipenv lock --clear
pipenv install# Then, open a shell with with the Python environment:
pipenv shell# Enable containerd image store in Docker Engine: https://docs.docker.com/engine/storage/containerd/
# This allows to build cross-platform images below
# Add the following to
# /etc/docker/daemon.json
# {
# "features": {
# "containerd-snapshotter": true
# }
# }
# sudo systemctl restart docker# The following should say containerd, if not follow instructions above
docker info -f '{{ .DriverStatus }}'# Build a new version of the base Docker container - ONLY NEEDED WHEN WE CHANGE DEPENDENCIES
export SCRAPER_BASE_VERSION="0.9.0" # Only need to change this when we update dependencies
docker buildx use typesense-builder || docker buildx create --name typesense-builder --driver docker-container --use --bootstrap # use same buildx context for all containers to build
docker buildx build --platform linux/amd64,linux/arm64 --load -f ./scraper/dev/docker/Dockerfile.base -t typesense/docsearch-scraper-base:${SCRAPER_BASE_VERSION} .
docker push typesense/docsearch-scraper-base:${SCRAPER_BASE_VERSION}
docker tag typesense/docsearch-scraper-base:${SCRAPER_BASE_VERSION} typesense/docsearch-scraper-base:latest
docker push typesense/docsearch-scraper-base:latest# Build a new version of the scraper Docker container
export SCRAPER_VERSION="0.11.0.rc1"
export SCRAPER_BASE_VERSION="latest"
docker buildx use typesense-builder || docker buildx create --name typesense-builder --driver docker-container --use --bootstrap # use same buildx context for all containers to build
docker buildx build --platform linux/amd64,linux/arm64 --load -f ./scraper/dev/docker/Dockerfile --build-arg SCRAPER_BASE_VERSION=${SCRAPER_BASE_VERSION} -t typesense/docsearch-scraper:${SCRAPER_VERSION} .
docker push typesense/docsearch-scraper:${SCRAPER_VERSION}
docker tag typesense/docsearch-scraper:${SCRAPER_VERSION} typesense/docsearch-scraper:latest
docker push typesense/docsearch-scraper:latest# Add a new Git tag.
git tag -a "${SCRAPER_VERSION}" -m "${SCRAPER_VERSION}"# Sync with GitHub.
git push --follow-tags```
## Help
If you have any questions or run into any problems, please create a Github issue and we'll try our best to help.