https://github.com/technologiestiftung/odis-geoexplorer-data

All data related scripts for Geoexplorer - a AI-driven search application for Berlin's geo data.
https://github.com/technologiestiftung/odis-geoexplorer-data

embeddings geospatial scraper supabase

Last synced: about 2 months ago
JSON representation

All data related scripts for Geoexplorer - a AI-driven search application for Berlin's geo data.

Host: GitHub
URL: https://github.com/technologiestiftung/odis-geoexplorer-data
Owner: technologiestiftung
License: mit
Created: 2024-03-12T11:05:29.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-10-23T16:41:05.000Z (7 months ago)
Last Synced: 2024-10-23T23:20:21.284Z (7 months ago)
Topics: embeddings, geospatial, scraper, supabase
Language: MDX
Homepage: https://github.com/technologiestiftung/odis-geoexplorer
Size: 20.1 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 13
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

![](https://img.shields.io/badge/Built%20with%20%E2%9D%A4%EF%B8%8F-at%20Technologiestiftung%20Berlin-blue)

[![All Contributors](https://img.shields.io/badge/all_contributors-2-orange.svg?style=flat-square)](#contributors-)

# Geoexplorer Data

This repository includes all logic around the data needed for the [GeoExplorer](https://github.com/technologiestiftung/odis-geoexplorer) - a AI-driven search application for Berlin's geo data. It contains:

- A Node.js **scraper** script to collect the data. [🔗](#scraper)
- A script to create and write **embeddings** to a DB using OpenAIs and Supabase APIs. [🔗](#embeddings)
- A script to run a Jupyter notebook to **analyze and export the embeddings**. [🔗](#notebook)

### Scraper

The scraper (located in the [scraper folder](./scraper/)) gets all WFS & WMS related metadata from [Berlin's Open Data Portal](https://daten.berlin.de/) and [Berlin's Geo Data Portal (FisBroker)](https://fbinter.stadt-berlin.de/fb/) and writes a markdown file (.mdx) for each dataset. The scraper has multiple steps which you can control in the [index.js](./scraper/index.js) by (un)commenting them.

Before running the scraper you will need to install **npm** and the dependencies:

```code
npm i
```

Run the scraper like so:

```code
npm run scrape
```

Or if you want to update the data:

```code
npm run scrape:update
```

### Setting up a Supabase DB and creating embeddings

**1. Set up a local Supabase DB** (optional)

The initialization of the database, including the setup of the `pgvector` extension is stored in the [`supabase/migrations` folder](./supabase/migrations/) which is automatically applied to your local Postgres instance when running `npx supabase start`

Make sure you have **Docker** installed and running locally. Then run

```bash
npx supabase start
```

This will set up a local Supabase DB for you.

**2. Provide connection details**

Duplicate the `.env.example` file and rename it to `.env`. Then provide either your local connection details or those from Supabase, depending on where you want to save your data.

- To retrieve your local `NEXT_PUBLIC_SUPABASE_ANON_KEY` and `SUPABASE_SERVICE_ROLE_KEY` run:

```bash
npx supabase status
```

You will also need to provide a key to use **OpenAI API**.

**3. Generate embeddings**

This script requests an embedding for each markdown file created earlier. The embedding will then be written to your Supabase DB. To run the script:

```bash
npm run embeddings
```

> Note: Make sure Supabase is running. To check, run `supabase status`. If is not running, run `supabase start`.

**4. Link your local development project to a hosted Supabase project** (optional)

You can do this like so (your data will not be uploaded):

```bash
npx supabase
npx supabase link SUPABASE_DB_PASSWORD
npx supabase login
npx supabase link --project-ref SUPABASE_DB_PASSWORD
npx supabase db push
```

### Running Jupyter notebook to analyze and export the embeddings.

Go to your graphical interface of your Supabase DB (e.g., http://localhost:54323/project/default/editor) and export the _nods_page_section_rows_ table as a .csv file. Save the file in the [createGraph](/createGraph/) folder. Then install **jupyter notebook** via pip if you haven't installed it yet.

```code
pip install notebook
```

Run the notebook like so:

```code
npm run embedgraph
```

This will open a new window in your browser.

> You can also access the notebook directly via [http://localhost:8888/notebooks/embeds.ipynb](http://localhost:8888/notebooks/embeds.ipynb).

Run the notebook. It will show you a scatterplot representing the vectors in a 2D representation.

At the bottom of the notebook, you will find a link called _tsne_data.csv_. This will allow you to download the 2D coordinates including the titles of the dataset. The data is used to update the scatterplot displayed in the GeoExplorer.

> The Notebook script is based on [OpenAI guides](https://platform.openai.com/docs/guides/embeddings/use-cases).

## Contributing

Before you create a pull request, write an issue so we can discuss your changes.

## Contributors

Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):

_{Hans Hack}
💻 🖋 🔣 📖 📆

_alsino
💻

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. Contributions of any kind welcome!

## Content Licensing

Texts and content available as [CC BY](https://creativecommons.org/licenses/by/3.0/de/).

## Credits

Made by:

Together with:

A project by

Supported by

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/technologiestiftung/odis-geoexplorer-data

Awesome Lists containing this project

README