Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/eran-gil/covid19-data-lake
https://github.com/eran-gil/covid19-data-lake
Last synced: 6 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/eran-gil/covid19-data-lake
- Owner: eran-gil
- License: apache-2.0
- Created: 2021-03-01T20:22:56.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2023-02-04T15:57:38.000Z (almost 2 years ago)
- Last Synced: 2023-07-08T20:42:24.326Z (over 1 year ago)
- Language: C#
- Size: 44 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# covid19-data-lake
This project intends on creating a data lake using AWS S3 that has the following indices:
* File-based content index based on the following [article](https://www.semanticscholar.org/paper/Needle-in-a-haystack-queries-in-cloud-data-lakes-Weintraub-Gudes/1f5b9de16302525ab07e50056f3a8af565fd131a).
* HyperLogLog metadata index to indicate the approximate number of unique values of a metadata field
* Count-Min-Sketch metadata index to indicate the approximate number of repetitions per each value of a metadata fieldThe data lake is accompanied by an API to do the following:
* Upload a file to the data lake and start the indexing process
* Query for content based on the content-index
* Query for metadata statistics based on the metadata indices### Project Structure
|Project Name|Purpose|
|---|---|
|CovidDataLake.Cloud|Common code to access the cloud resources of the data lake|
|CovidDataLake.Common|Common code that is shared between all of the services|
|CovidDataLake.ContentIndexer|The engine that indexes the contents of files in the data lake|
|CovidDataLake.MetadataIndexer|The engine that indexes the metadata of files in the data lake|
|CovidDataLake.Pubsub|Common code to publish and subscribe to events in the ETL process|
|CovidDataLake.Queries|The business-logic of the queries performed on the data lake|
|CovidDataLake.Storage|Common code to handle usage of local disk storage|
|CovidDataLake.WebAPI|The API for the data lake, includes updates and queries|### Getting Started Requirements
* [.NET 6.0](https://dotnet.microsoft.com/en-us/download/dotnet/6.0) installed
* [Redis server](https://redis.io/download/) running and configured correctly in all relevant `appsettings.json` files in the following way:
```json
{
"Redis": "[HOSTNAME]:[PORT],connectTimeout=15000,syncTimeout=15000"
}
```
* [Kafka](https://kafka.apache.org/downloads) cluster running with all instances configured correctly in all relevant `appsettings.json` files in the following way:
```json
{
"Kafka": {
"Instances": [
{
"Host": "[HOST_NAME]",
"Port": 9092
}
],
"Topic": "[TOPIC_NAME]",
"GroupId": "[CONSUMER_GROUP_ID]" //this is used only for consuming projects (aka indexing engines)
}
```
* Project-specific requirements are listed inside each project's folder