https://github.com/derlin/doodlechallenge

My solution to Doodle's Data Engineering Hiring Challenge (March, 2020)
https://github.com/derlin/doodlechallenge

Last synced: 3 months ago
JSON representation

My solution to Doodle's Data Engineering Hiring Challenge (March, 2020)

Host: GitHub
URL: https://github.com/derlin/doodlechallenge
Owner: derlin
Created: 2020-03-22T10:41:34.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2020-03-22T10:49:42.000Z (over 6 years ago)
Last Synced: 2023-04-06T23:58:13.116Z (over 3 years ago)
Language: Go
Homepage:
Size: 33.2 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Doodle Data Engineering Hiring Challenge - derlin

This repository contains the code and the report for the Doodle Hiring Challenge described here:
https://github.com/tamediadigital/hiring-challenges/tree/master/data-engineer-challenge

**⇨ see [report.md](report.md) for details ⇦**

## Structure

Folders:

* `cmd` contains the main program,
* `baseline` contains a basic baseline program that just read frames from Kafka,
* `benchmarks` contains the script + results of a little benchmark experiment

Files:

* `report.md`: explains and discuss the current solution and what else we could do,
* `docker-compose-yml`: what I used to run Kafka on my Mac

## Prerequisites

1. have a working go installation
2. have kafka running on `localhost:9092`
3. have two kafka topics `doodle` and `doodle-out` (see constants in `cmd/main.go`)
4. have the content of `stream.jsonl` in the kafka topic `doodle` (see their readme) (do it only once !)
5. (if using my docker-compose) have a folder `data` with the `stream.jsonl` file (see below)

## Install and run

1. clone this repo
2. install dependencies: `go get github.com/segmentio/kafka-go` or `go get cmd/..`
2. build and run using `go run cmd/*.go`, or build *then* run using `go build -o doodlechallenge cmd/*.go && ./doodlechallenge`

## Using docker for Kafka

Get the data:
```bash
mkdir data
cd data
wget http://tx.tamedia.ch.s3.amazonaws.com/challenge/data/stream.jsonl.gz
gunzip -k stream.jsonl.gz
cd ..
```

Launch the containers:
```bash
docker-compose up -d
```

Get a bash into the kafka container:
```bash
docker exec -it doodlechallenge_kafka_1
```

Setup the topics and the data (still inside the kafka container):

```bash
/opt/bitnami/kafka/bin/kafka-topics.sh --create \
--zookeeper zookeeper:2181 \
--replication-factor 1 --partitions 1 \
--topic doodle

/opt/bitnami/kafka/bin/kafka-topics.sh --create \
--zookeeper zookeeper:2181 \
--replication-factor 1 --partitions 1 \
--topic doodle-out

# optionally, set a very low retention policy on the out topic, in case we run many times the program
/opt/bitnami/kafka/bin/kafka-configs.sh \
--zookeeper localhost:2181 \
--alter \
--topic doodle-out \
--config retention.ms=1000

cat /data/stream.jsonl | /opt/bitnami/kafka/bin/kafka-console-producer.sh --broker-list kafka:9092 --topic doodle
```
Note that the last command may take a while, there are 1M records to ingest !

----------------------------------

Derlin @ 2020 (during the corona virus outbreak)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/derlin/doodlechallenge

Awesome Lists containing this project

README