An open API service indexing awesome lists of open source software.

https://github.com/zerohertz/yolo-serving-cookbook

πŸ“Έ YOLO Serving Cookbook based on Triton Inference Server πŸ“Έ
https://github.com/zerohertz/yolo-serving-cookbook

docker docker-compose fastapi gradio k8s kubernetes mlops model-serving onnx pytorch triton-inference-server yolo yolov5

Last synced: about 1 year ago
JSON representation

πŸ“Έ YOLO Serving Cookbook based on Triton Inference Server πŸ“Έ

Awesome Lists containing this project

README

          


πŸ“Έ YOLO Serving Cookbook πŸ“Έ






## [1. Docker](https://github.com/Zerohertz/YOLO-Serving/tree/1.Docker)

Architecture


Docker

## [2. Docker Compose](https://github.com/Zerohertz/YOLO-Serving/tree/2.Docker-Compose)

Architecture


Docker-Compose

## 3. Kubernetes

Architecture (without Ensemble)

Number of Replicas = 1
Number of Replicas = 5

Kubernetes-Rep=1
Kubernetes-Rep=5

Architecture (with Ensemble)

Number of Replicas = 1
Number of Replicas = 5

Kubernetes-Ensemble-Rep=1
Kubernetes-Ensemble-Rep=5

### Experimental Setup

+ Server
+ `Sync`: FastAPIμ—μ„œ 동기 처리
+ `Async`: FastAPIμ—μ„œ 비동기 처리
+ `Rep`: `fastapi`와 `triton-inference-server`의 replica 수
+ `Ensemble`: `triton-inference-server` λ‚΄μ—μ„œ [ensemble](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models)을 ν™œμš©ν•΄ μ „, ν›„μ²˜λ¦¬ 및 μ‹œκ°ν™”λ₯Ό μˆ˜ν–‰ (`fastapi`λŠ” λΉ„λ™κΈ°λ‘œ μž‘λ™)
+ Client (FastAPIλ₯Ό 100회 호좜, 10회 μ‹€ν—˜)
+ `Serial`: `for`문을 μ΄μš©ν•΄ 직렬적 호좜
+ `Concurrency`: `ThreadPoolExecutor`λ₯Ό μ΄μš©ν•΄ λ™μ‹œ 호좜
+ `Random`: `ThreadPoolExecutor`λ₯Ό 이용 및 0 ~ 20초 이후 랜덀 호좜

### Results

λ‹¨μœ„: [Sec]

|Server Arch.|Mean(Serial)|End(Serial)|Mean(Concurrency)|End(Concurrency)|Mean(Random)|End(Random)|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|[Sync&Rep=1](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-1.Sync)|0.69|78.01|41.93|129.61|40.05|128.63|
|[Sync&Rep=5](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-1.Sync)|0.60|68.99|25.57|61.38|26.88|81.69|
|[Async&Rep=1](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-2.Async)|0.68|77.02|0.80|82.22|0.78|80.34|
|[Async&Rep=1-5](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-2.Async)|0.61|69.07|0.60|62.11|-|-|
|[Async&Rep=5](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-2.Async)|0.62|69.77|1.84|39.77|1.91|41.84|
|[Ensemble&Rep=1](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-3.Ensemble)|0.70|78.02|0.77|78.50|-|-|
|[Ensemble&Rep=5](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-3.Ensemble)|0.66|74.52|1.90|42.03|-|-|

Figures

EACH-SERIAL
TOTAL-SERIAL

EACH-CONCURRENCY
EACH-CONCURRENCY-ASYNC

TOTAL-CONCURRENCY

EACH-RANDOM
TOTAL-RANDOM

### Discussion

#### Sync, Async, Ensemble

λ‹¨μœ„: [Sec]

|Server Arch.|Mean(Serial)|End(Serial)|Mean(Concurrency)|End(Concurrency)|Mean(Random)|End(Random)|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Sync|0.647|73.499|33.752|95.496|33.460|105.160|
|Async|0.652|73.395|1.320|60.991|1.345|61.094|
|Ensemble|0.680|76.270|1.332|60.269|-|-|

직렬적 호좜 μ‹œ 동기, 비동기 방식은 차이가 μ‘΄μž¬ν•˜μ§€ μ•ŠλŠ”λ‹€.

ν•˜μ§€λ§Œ 비동기 방식은 동기 방식에 λΉ„ν•΄ λ™μ‹œμ  호좜 μ‹œ μ•½ 36.51%, 랜덀 호좜 μ‹œ μ•½ 41.90% λΉ λ₯Έ 응닡을 확인할 수 μžˆλ‹€.

반면 ensemble 방식을 톡해 큰 이점은 ν™•μΈν•˜μ§€ λͺ»ν–ˆμ§€λ§Œ, λ³Έ μ‹€ν—˜μ˜ ν•œκ³„μΌ 수 μžˆλ‹€. (λ¦¬μ†ŒμŠ€, 데이터 규λͺ¨, ...)

async def둜 μ •μ˜λœ FastAPIμ—μ„œ Random 쑰건의 였λ₯˜ λ°œμƒ

```python
Traceback (most recent call last):
File "anaconda3\lib\site-packages\requests\models.py", line 972, in json
return complexjson.loads(self.text, kwargs)
File "anaconda3\lib\site-packages\simplejson\__init__.py", line 514, in loads
return _default_decoder.decode(s)
File "anaconda3\lib\site-packages\simplejson\decoder.py", line 386, in decode
obj, end = self.raw_decode(s)
File "anaconda3\lib\site-packages\simplejson\decoder.py", line 416, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "Downloads\curl.py", line 70, in
main(i)
File "Downloads\curl.py", line 53, in main
responses = list(
File "anaconda3\lib\concurrent\futures\_base.py", line 609, in result_iterator
yield fs.pop().result()
File "anaconda3\lib\concurrent\futures\_base.py", line 439, in result
return self.__get_result()
File "anaconda3\lib\concurrent\futures\_base.py", line 391, in __get_result
raise self._exception
File "anaconda3\lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, self.kwargs)
File "Downloads\curl.py", line 24, in send_request
res = response.json()
File "anaconda3\lib\site-packages\requests\models.py", line 976, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```

μ΄λŠ” `Random` μ‘°κ±΄μ—μ„œ λ°œμƒν•˜λŠ” 였λ₯˜μΈλ°, `Concurrency` 쑰건에선 λ°œμƒν•˜μ§€ μ•ŠλŠ”κ²Œ μ΄μƒν•˜λ‹€.

λͺ¨λ“  pod에 λŒ€ν•΄ λ‘œκ·Έκ°€ λ‚¨μ•„μžˆλ‚˜ μ‚΄νŽ΄λ΄€μ§€λ§Œ λ°œκ²¬ν•˜μ§€ λͺ»ν–ˆλ‹€.

μ•„λ§ˆ λΉ„λ™κΈ°μ μœΌλ‘œ `triton-inference-server`의 serviceλ₯Ό ν˜ΈμΆœν•˜λ‹€κ°€ 생긴 였λ₯˜λ‘œ 보인닀.

> Chat GPT μ„ μƒλ‹˜μ˜ λ‹΅λ³€

FastAPIλ₯Ό μ‚¬μš©ν•˜λ©΄μ„œ νŠΉμ • νŒ¨ν„΄μ˜ μš”μ²­μ—μ„œλ§Œ 였λ₯˜κ°€ λ°œμƒν•˜λŠ” 경우, λ‹€μ–‘ν•œ 원인이 μžˆμ„ 수 μžˆμŠ΅λ‹ˆλ‹€. λ‹€μŒμ€ λͺ‡ κ°€μ§€ κ°€λŠ₯성에 λŒ€ν•΄ μ„€λͺ…ν•©λ‹ˆλ‹€:

1. λ¦¬μ†ŒμŠ€ ν•œκ³„: μ„œλ²„ λ˜λŠ” μ‚¬μš© 쀑인 λ°μ΄ν„°λ² μ΄μŠ€μ˜ μ—°κ²° μˆ˜λ‚˜ 기타 λ¦¬μ†ŒμŠ€ ν•œκ³„μ— 도달할 수 μžˆμŠ΅λ‹ˆλ‹€. ν•œ μˆœκ°„μ— 100회 μš”μ²­μ„ 보낼 λ•Œ, 각 μš”μ²­μ΄ λΉ λ₯΄κ²Œ μ²˜λ¦¬λ˜μ–΄μ„œ λ¦¬μ†ŒμŠ€κ°€ λΉ λ₯΄κ²Œ ν•΄μ œλ˜λ©΄ λ¬Έμ œκ°€ 없을 수 μžˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ 100회의 μš”μ²­μ„ 20초 λ™μ•ˆ μ§€μ†μ μœΌλ‘œ 보낼 경우, κ·Έ 사이에 λ¦¬μ†ŒμŠ€κ°€ ν•΄μ œλ˜μ§€ μ•Šμ•„ λ¬Έμ œκ°€ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
2. 레이트 λ¦¬λ°‹νŒ…: 일뢀 μ‹œμŠ€ν…œμ€ 짧은 μ‹œκ°„ λ™μ•ˆ λ„ˆλ¬΄ λ§Žμ€ μš”μ²­μ„ μ²˜λ¦¬ν•˜λŠ” 것을 λ°©μ§€ν•˜κΈ° μœ„ν•΄ 레이트 λ¦¬λ°‹νŒ…(rate limiting)을 μ‚¬μš©ν•©λ‹ˆλ‹€. κ·Έλ ‡κΈ° λ•Œλ¬Έμ— 20초 λ™μ•ˆ κΎΈμ€€νžˆ μš”μ²­μ„ 보내면 μ œν•œμ— 걸릴 수 μžˆμŠ΅λ‹ˆλ‹€.
3. λ©”λͺ¨λ¦¬ 문제: μš”μ²­μ΄ λˆ„μ λ˜λ©΄μ„œ μ„œλ²„μ˜ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ λŠ˜μ–΄λ‚  수 μžˆμŠ΅λ‹ˆλ‹€. λ§Œμ•½ λ©”λͺ¨λ¦¬κ°€ λΆ€μ‘±ν•˜λ‹€λ©΄, μš”μ²­ 처리 쀑에 μ—λŸ¬κ°€ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
4. 쀑볡 μš”μ²­ 처리: 20초 λ™μ•ˆ μ§€μ†μ μœΌλ‘œ μš”μ²­μ„ 보낼 λ•Œ, 이전 μš”μ²­μ˜ κ²°κ³Όλ‚˜ μƒνƒœκ°€ λ‹€μŒ μš”μ²­μ˜ μ²˜λ¦¬μ— 영ν–₯을 쀄 수 μžˆμŠ΅λ‹ˆλ‹€. μ€‘λ³΅λ˜λŠ” μš”μ²­μ΄λ‚˜ μƒνƒœ 변경이 원인이 될 수 μžˆμŠ΅λ‹ˆλ‹€.
5. λ‘œκΉ… λ˜λŠ” λͺ¨λ‹ˆν„°λ§ 도ꡬ: μš”μ²­μ„ μ²˜λ¦¬ν•˜λŠ” λ™μ•ˆ 둜그λ₯Ό λ‚¨κΈ°κ±°λ‚˜ λͺ¨λ‹ˆν„°λ§ 도ꡬλ₯Ό μ‚¬μš©ν•˜λŠ” 경우, μš”μ²­ λΉˆλ„κ°€ λ†’μ•„μ§€λ©΄ 좔가적인 λΆ€ν•˜κ°€ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.

+ ν•΄κ²° 방법
+ λ‘œκΉ…: λ°œμƒν•˜λŠ” 였λ₯˜μ˜ 둜그λ₯Ό ν™•μΈν•˜μ—¬ ꡬ체적인 원인을 νŒŒμ•…ν•©λ‹ˆλ‹€.
+ ν”„λ‘œνŒŒμΌλ§: μ„œλ²„μ˜ CPU, λ©”λͺ¨λ¦¬, λ„€νŠΈμ›Œν¬ μ‚¬μš©λŸ‰ λ“±μ˜ λ¦¬μ†ŒμŠ€ μ‚¬μš© 상황을 λͺ¨λ‹ˆν„°λ§ν•˜μ—¬ 병λͺ© ν˜„μƒμ΄λ‚˜ λ¦¬μ†ŒμŠ€ λΆ€μ‘± 문제λ₯Ό νŒŒμ•…ν•©λ‹ˆλ‹€.
+ ν…ŒμŠ€νŒ…: μ„œλ²„μ— μš”μ²­μ„ λ³΄λ‚΄λŠ” λ‹€μ–‘ν•œ μ‹œλ‚˜λ¦¬μ˜€λ₯Ό ν…ŒμŠ€νŠΈν•˜μ—¬ 문제의 원인을 νŒŒμ•…ν•©λ‹ˆλ‹€.

μ΄λŸ¬ν•œ 점검을 톡해 문제의 원인을 νŒŒμ•…ν•˜κ³  μ μ ˆν•œ 쑰치λ₯Ό μ·¨ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

#### Replicas

λ‹¨μœ„: [Sec]

|Server Arch.|Mean(Serial)|End(Serial)|Mean(Concurrency)|End(Concurrency)|Mean(Random)|End(Random)|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Rep=1|0.691|77.682|14.501|96.777|20.415|104.487|
|Rep=5|0.629|71.094|9.767|47.726|14.391|61.767|

Replica 수의 증가λ₯Ό 톡해 API의 응닡을 λΉ λ₯΄κ²Œ ν•  수 μžˆμŒμ„ ν™•μΈν–ˆλ‹€. ([νŒŒλ“œλŠ” μ„œλΉ„μŠ€μ™€ ν†΅μ‹ ν•˜λ„λ‘ ꡬ성할 수 있으며, μ„œλΉ„μŠ€μ™€μ˜ 톡신은 μ„œλΉ„μŠ€μ˜ 맴버 쀑 일뢀 νŒŒλ“œμ— μžλ™μ μœΌλ‘œ λ‘œλ“œ-λ°ΈλŸ°μ‹± λœλ‹€.](https://kubernetes.io/ko/docs/tutorials/services/connect-applications-service/#%EC%84%9C%EB%B9%84%EC%8A%A4-%EC%83%9D%EC%84%B1%ED%95%98%EA%B8%B0))

특히 λ™μ‹œμ  호좜 μ‹œ 큰 ν–₯상이 μžˆμŒμ„ 확인할 수 μžˆλ‹€.

WORKER TIMEOUT

`fastapi`의 replicaλŠ” 1개, `triton-inference-server`의 replicaλŠ” 5개 일 λ•ŒλŠ” λ°œμƒν•˜μ§€ μ•Šλ˜ 였λ₯˜κ°€ `fastapi`의 replicaλŠ” 5개, `triton-inference-server`의 replicaλŠ” 5개 일 λ•Œ μ•„λž˜μ™€ 같이 λ°œμƒν–ˆλ‹€.

이것은 `"--timeout", "120"`을 `Dockerfile`에 μΆ”κ°€ν•˜μ—¬ ν•΄κ²°ν–ˆλ‹€.

```bash
[1] [CRITICAL] WORKER TIMEOUT (pid:8)
[1] [WARNING] Worker with pid 8 was terminated due to signal 6
[379] [INFO] Booting worker with pid: 379
[379] [INFO] Started server process [379]
[379] [INFO] Waiting for application startup.
[379] [INFO] Application startup complete.
```

#### Autoscaling

`HPA` μ‚¬μš© μ‹œ ν•œ μˆœκ°„μ— 100회의 μš”μ²­μ΄ μž…λ ₯되면 replicaλ₯Ό μƒμ„±ν•˜κΈ° 전에 단일 `fastapi` pod에 μž…λ ₯되기 λ•Œλ¬Έμ— autoscaling 효과λ₯Ό λ³Ό 수 μ—†λ‹€.

λ”°λΌμ„œ autoscaling을 μ›ν™œνžˆ ν•˜λ €λ©΄ `Resource` 기쀀이 μ•„λ‹Œ μƒˆλ‘œμš΄ `metrics`κ°€ ν•„μš”ν•˜λ‹€.

μ˜ˆμ‹œ: hpa.yaml

```yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: triton-inference-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference-server
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: fastapi-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fastapi
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
```

### [3.4. Gradio](https://github.com/Zerohertz/YOLO-Serving-Cookbook/tree/3.Kubernetes-4.Gradio)

Architecture

![](https://github.com/Zerohertz/YOLO-Serving-Cookbook/assets/42334717/fa647b85-9716-4fd8-933a-bb92ebbda62f)

![Gradio](https://github.com/Zerohertz/Zerohertz/assets/42334717/816ec0eb-7ba4-49d4-8302-6a720aba91d4)