https://github.com/expediadotcom/haystack-tables
This is an EXPERIMENTAL project - not ready for production use.
https://github.com/expediadotcom/haystack-tables
Last synced: about 1 month ago
JSON representation
This is an EXPERIMENTAL project - not ready for production use.
- Host: GitHub
- URL: https://github.com/expediadotcom/haystack-tables
- Owner: ExpediaDotCom
- Created: 2019-02-19T16:34:29.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2019-03-03T10:49:47.000Z (over 7 years ago)
- Last Synced: 2025-03-04T13:47:21.268Z (over 1 year ago)
- Language: Java
- Homepage:
- Size: 130 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Architecture

### Getting Started?
Launch the table-allocator dropwizard app that exposes endpoint for creating and listing the views.
The allocator uses kubernetes for running parquet-writers by default. If you are using minikube, make sure it is running and current k8s context points to it.
##### Create a new view:
```
curl -XPOST -H "Content-Type: application/json" -d '
{
"view": "oms",
"select": [
"tags[errorcode]",
"operationname"
],
"where": {
"servicename": "oms"
}
}' "http://localhost:8080/view"
```
##### List all views:
```
curl "http://localhost:8080/views"
Response:
[
{
"createTimestamp": "2019-03-03T10:17:50.000Z",
"lastUpdatedTimestamp": "2019-03-03T10:17:50.866Z",
"query": {
"view": "oms-test",
"select": [
"tags[errorcode]",
"operationname"
],
"where": {
"servicename": "oms"
}
},
"running": true
}
]
```
##### Delete a view:
```
curl -XDELETE "http://localhost:8080/view/oms"
```
### S3 Data
Parquet writer runs independently for each requested view. They put the parquet data under a configured bucket name with following partitoning strategy:
`s3://bucket-name/views/{view-name}/year=2019/month=02/day=03/hour=12/..`
The parquet files are named with the last kafka-offset value of the record in the file itself.
### Athena Tables
Allocator provides an endpoint `/athena/refresh` that takes following action for all the running views:
* Create partitioned table in Athena under haystack_tables database
* Repair the already existing table to add new s3 partitions
We run a cron job that hits this endpoint after every few minutes to make sure the tables are always upto date.