{"id":21586522,"url":"https://github.com/vida-nyu/auctus","last_synced_at":"2025-04-10T20:21:08.704Z","repository":{"id":40290983,"uuid":"227004642","full_name":"VIDA-NYU/auctus","owner":"VIDA-NYU","description":"Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index","archived":false,"fork":false,"pushed_at":"2023-11-12T04:58:09.000Z","size":10866,"stargazers_count":43,"open_issues_count":14,"forks_count":9,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-03-24T17:55:23.082Z","etag":null,"topics":["crawling","data-profiling","dataset","dataset-search","index","search","search-engine"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VIDA-NYU.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-10T01:44:38.000Z","updated_at":"2025-03-17T08:05:02.000Z","dependencies_parsed_at":"2022-08-09T16:24:16.894Z","dependency_job_id":null,"html_url":"https://github.com/VIDA-NYU/auctus","commit_stats":null,"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2Fauctus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2Fauctus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2Fauctus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VIDA-NYU%2Fauctus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VIDA-NYU","download_url":"https://codeload.github.com/VIDA-NYU/auctus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248289868,"owners_count":21078922,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","data-profiling","dataset","dataset-search","index","search","search-engine"],"created_at":"2024-11-24T15:13:57.003Z","updated_at":"2025-04-10T20:21:08.685Z","avatar_url":"https://github.com/VIDA-NYU.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Auctus\n======\n\nThis project is a web crawler and search engine for datasets, specifically meant for data augmentation tasks in machine learning. It is able to find datasets in different repositories and index them for later retrieval.\n\n[Documentation is available here](https://docs.auctus.vida-nyu.org/)\n\nIt is divided in multiple components:\n\n* Libraries\n  * [Geospatial database](https://gitlab.com/ViDA-NYU/auctus/datamart-geo) `datamart_geo`. This contains data about administrative areas extracted from Wikidata and OpenStreetMap. It lives in its own repository and is used here as a submodule.\n  * [Profiling library](lib_profiler/) `datamart_profiler`. This can be installed by clients, will allow the client library to profile datasets locally instead of sending them to the server. It is also used by the apiserver and profiler services.\n  * [Materialization library](lib_materialize/) `datamart_materialize`. This is used to materialize dataset from the various sources that Auctus supports. It can be installed by clients, which will allow them to materialize datasets locally instead of using the server as a proxy.\n  * [Data augmentation library](lib_augmentation/) `datamart_augmentation`. This performs the join or union of two datasets and is used by the apiserver service, but could conceivably be used stand-alone.\n  * [Core server library](lib_core/) `datamart_core`. This contains common code for services. Only used for the server components. The [filesystem locking](lib_fslock/) code is separate as `datamart_fslock` for performance reasons (has to import fast).\n* Services\n  * [**Discovery services**](discovery/): those are responsible for discovering datasets. Each plugin can talk to a specific repository. *Materialization metadata* is recorded for each dataset, to allow future retrieval of that dataset.\n  * [**Profiler**](profiler/): this service downloads a discovered dataset and computes additional metadata that can be used for search (for example, dimensions, semantic types, value distributions). Uses the profiling and materialization libraries.\n  * **Lazo Server**: this service is responsible for indexing textual and categorical attributes using [Lazo](https://github.com/mitdbg/lazo). The code for the server and client is available [here](https://gitlab.com/ViDA-NYU/auctus/lazo-index-service).\n  * [**apiserver**](apiserver/): this service responds to requests from clients to search for datasets in the index (triggering on-demand query by discovery services that support it), upload new datasets, profile datasets, or perform augmentation. Uses the profiling and materialization libraries. Implements a JSON API using the Tornado web framework.\n  * [The **cache-cleaner**](cache_cleaner/): this service makes sure the dataset cache stays under a given size limit by removing least-recently-used datasets when the configured size is reached.\n  * [The **coordinator**](coordinator/): this service collects some metrics and offers a maintenance interface for the system administrator.\n  * [The **frontend**](frontend/): this is a React app implementing a user-friendly web interface on top of the API.\n\n![Auctus Architecture](docs/architecture.png)\n\nElasticsearch is used as the search index, storing one document per known dataset.\n\nThe services exchange messages through `RabbitMQ`, allowing us to have complex messaging patterns with queueing and retrying semantics, and complex patterns such as the on-demand querying.\n\n![AMQP Overview](docs/amqp.png)\n\nDeployment\n==========\n\nThe system is currently running at https://auctus.vida-nyu.org/. You can see the system status at https://grafana.auctus.vida-nyu.org/.\n\nLocal deployment / development setup\n====================================\n\nTo deploy the system locally using docker-compose, follow those step:\n\nSet up environment\n------------------\n\nMake sure you have checked out the submodule with `git submodule init \u0026\u0026 git submodule update`\n\nMake sure you have [Git LFS](https://git-lfs.github.com/) installed and configured (`git lfs install`)\n\nCopy env.default to .env and update the variables there. You might want to update the password for a production deployment.\n\nMake sure your node is set up for running Elasticsearch. You will probably have to [raise the mmap limit](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/vm-max-map-count.html).\n\nThe `API_URL` is the URL at which the apiserver containers will be visible to clients. In a production deployment, this is probably a public-facing HTTPS URL. It can be the same URL that the \"coordinator\" component will be served at if using a reverse proxy (see [nginx.conf](nginx.conf)).\n\nTo run scripts locally, you can load the environment variables into your shell by running: `. scripts/load_env.sh` (that's *dot space scripts...*)\n\nPrepare data volumes\n--------------------\n\nRun `scripts/setup.sh` to initialize the data volumes. This will set the correct permissions on the `volumes/` subdirectories.\n\nShould you ever want to start from scratch, you can delete `volumes/` but make sure to run `scripts/setup.sh` again afterwards to set permissions.\n\nBuild the containers\n--------------------\n\n```\n$ docker-compose build --build-arg version=$(git describe) apiserver\n```\n\nStart the base containers\n-------------------------\n\n```\n$ docker-compose up -d elasticsearch rabbitmq redis minio lazo\n```\n\nThese will take a few seconds to get up and running. Then you can start the other components:\n\n```\n$ docker-compose up -d cache-cleaner coordinator profiler apiserver apilb frontend\n```\n\nYou can use the `--scale` option to start more profiler or apiserver containers, for example:\n\n```\n$ docker-compose up -d --scale profiler=4 --scale apiserver=8 cache-cleaner coordinator profiler apiserver apilb frontend\n```\n\nPorts:\n* The web interface is at http://localhost:8001\n* The API at http://localhost:8002/api/v1 (behind HAProxy)\n* Elasticsearch is at http://localhost:8020\n* The Lazo server is at http://localhost:8030\n* The RabbitMQ management interface is at http://localhost:8010\n* The RabbitMQ metrics are at http://localhost:8012\n* The Minio interface is at http://localhost:8050 (if you use that)\n* The HAProxy statistics are at http://localhost:8004\n* Prometheus is at http://localhost:8040\n* Grafana is at http://localhost:8041\n\nImport a snapshot of our index (optional)\n-----------------------------------------\n\n```\n$ scripts/docker_import_snapshot.sh\n```\n\nThis will download an Elasticsearch dump from auctus.vida-nyu.org and import it into your local Elasticsearch container.\n\nStart discovery plugins (optional)\n----------------------------------\n\n```\n$ docker-compose up -d socrata zenodo\n```\n\nStart metric dashboard (optional)\n---------------------------------\n\n```\n$ docker-compose up -d elasticsearch_exporter prometheus grafana\n```\n\nPrometheus is configured to automatically find the containers (see [prometheus.yml](docker/prometheus.yml))\n\nA custom RabbitMQ image is used, with added plugins (management and prometheus).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvida-nyu%2Fauctus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvida-nyu%2Fauctus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvida-nyu%2Fauctus/lists"}