https://github.com/zerohertz/yolo-serving-cookbook
πΈ YOLO Serving Cookbook based on Triton Inference Server πΈ
https://github.com/zerohertz/yolo-serving-cookbook
docker docker-compose fastapi gradio k8s kubernetes mlops model-serving onnx pytorch triton-inference-server yolo yolov5
Last synced: about 1 year ago
JSON representation
πΈ YOLO Serving Cookbook based on Triton Inference Server πΈ
- Host: GitHub
- URL: https://github.com/zerohertz/yolo-serving-cookbook
- Owner: Zerohertz
- License: agpl-3.0
- Archived: true
- Created: 2023-10-23T07:09:59.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-15T13:12:02.000Z (almost 2 years ago)
- Last Synced: 2025-02-12T05:02:34.012Z (about 1 year ago)
- Topics: docker, docker-compose, fastapi, gradio, k8s, kubernetes, mlops, model-serving, onnx, pytorch, triton-inference-server, yolo, yolov5
- Language: Python
- Homepage:
- Size: 1.49 MB
- Stars: 4
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
πΈ YOLO Serving Cookbook πΈ
## [1. Docker](https://github.com/Zerohertz/YOLO-Serving/tree/1.Docker)
Architecture
## [2. Docker Compose](https://github.com/Zerohertz/YOLO-Serving/tree/2.Docker-Compose)
Architecture
## 3. Kubernetes
Architecture (without Ensemble)
Number of Replicas = 1
Number of Replicas = 5

Architecture (with Ensemble)
Number of Replicas = 1
Number of Replicas = 5

### Experimental Setup
+ Server
+ `Sync`: FastAPIμμ λκΈ° μ²λ¦¬
+ `Async`: FastAPIμμ λΉλκΈ° μ²λ¦¬
+ `Rep`: `fastapi`μ `triton-inference-server`μ replica μ
+ `Ensemble`: `triton-inference-server` λ΄μμ [ensemble](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models)μ νμ©ν΄ μ , νμ²λ¦¬ λ° μκ°νλ₯Ό μν (`fastapi`λ λΉλκΈ°λ‘ μλ)
+ Client (FastAPIλ₯Ό 100ν νΈμΆ, 10ν μ€ν)
+ `Serial`: `for`λ¬Έμ μ΄μ©ν΄ μ§λ ¬μ νΈμΆ
+ `Concurrency`: `ThreadPoolExecutor`λ₯Ό μ΄μ©ν΄ λμ νΈμΆ
+ `Random`: `ThreadPoolExecutor`λ₯Ό μ΄μ© λ° 0 ~ 20μ΄ μ΄ν λλ€ νΈμΆ
### Results
λ¨μ: [Sec]
|Server Arch.|Mean(Serial)|End(Serial)|Mean(Concurrency)|End(Concurrency)|Mean(Random)|End(Random)|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|[Sync&Rep=1](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-1.Sync)|0.69|78.01|41.93|129.61|40.05|128.63|
|[Sync&Rep=5](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-1.Sync)|0.60|68.99|25.57|61.38|26.88|81.69|
|[Async&Rep=1](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-2.Async)|0.68|77.02|0.80|82.22|0.78|80.34|
|[Async&Rep=1-5](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-2.Async)|0.61|69.07|0.60|62.11|-|-|
|[Async&Rep=5](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-2.Async)|0.62|69.77|1.84|39.77|1.91|41.84|
|[Ensemble&Rep=1](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-3.Ensemble)|0.70|78.02|0.77|78.50|-|-|
|[Ensemble&Rep=5](https://github.com/Zerohertz/YOLO-Serving/tree/3.Kubernetes-3.Ensemble)|0.66|74.52|1.90|42.03|-|-|
Figures




### Discussion
#### Sync, Async, Ensemble
λ¨μ: [Sec]
|Server Arch.|Mean(Serial)|End(Serial)|Mean(Concurrency)|End(Concurrency)|Mean(Random)|End(Random)|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Sync|0.647|73.499|33.752|95.496|33.460|105.160|
|Async|0.652|73.395|1.320|60.991|1.345|61.094|
|Ensemble|0.680|76.270|1.332|60.269|-|-|
μ§λ ¬μ νΈμΆ μ λκΈ°, λΉλκΈ° λ°©μμ μ°¨μ΄κ° μ‘΄μ¬νμ§ μλλ€.
νμ§λ§ λΉλκΈ° λ°©μμ λκΈ° λ°©μμ λΉν΄ λμμ νΈμΆ μ μ½ 36.51%, λλ€ νΈμΆ μ μ½ 41.90% λΉ λ₯Έ μλ΅μ νμΈν μ μλ€.
λ°λ©΄ ensemble λ°©μμ ν΅ν΄ ν° μ΄μ μ νμΈνμ§ λͺ»νμ§λ§, λ³Έ μ€νμ νκ³μΌ μ μλ€. (리μμ€, λ°μ΄ν° κ·λͺ¨, ...)
async defλ‘ μ μλ FastAPIμμ Random 쑰건μ μ€λ₯ λ°μ
```python
Traceback (most recent call last):
File "anaconda3\lib\site-packages\requests\models.py", line 972, in json
return complexjson.loads(self.text, kwargs)
File "anaconda3\lib\site-packages\simplejson\__init__.py", line 514, in loads
return _default_decoder.decode(s)
File "anaconda3\lib\site-packages\simplejson\decoder.py", line 386, in decode
obj, end = self.raw_decode(s)
File "anaconda3\lib\site-packages\simplejson\decoder.py", line 416, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.errors.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "Downloads\curl.py", line 70, in
main(i)
File "Downloads\curl.py", line 53, in main
responses = list(
File "anaconda3\lib\concurrent\futures\_base.py", line 609, in result_iterator
yield fs.pop().result()
File "anaconda3\lib\concurrent\futures\_base.py", line 439, in result
return self.__get_result()
File "anaconda3\lib\concurrent\futures\_base.py", line 391, in __get_result
raise self._exception
File "anaconda3\lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, self.kwargs)
File "Downloads\curl.py", line 24, in send_request
res = response.json()
File "anaconda3\lib\site-packages\requests\models.py", line 976, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```
μ΄λ `Random` 쑰건μμ λ°μνλ μ€λ₯μΈλ°, `Concurrency` 쑰건μμ λ°μνμ§ μλκ² μ΄μνλ€.
λͺ¨λ podμ λν΄ λ‘κ·Έκ° λ¨μμλ μ΄ν΄λ΄€μ§λ§ λ°κ²¬νμ§ λͺ»νλ€.
μλ§ λΉλκΈ°μ μΌλ‘ `triton-inference-server`μ serviceλ₯Ό νΈμΆνλ€κ° μκΈ΄ μ€λ₯λ‘ λ³΄μΈλ€.
> Chat GPT μ μλμ λ΅λ³
FastAPIλ₯Ό μ¬μ©νλ©΄μ νΉμ ν¨ν΄μ μμ²μμλ§ μ€λ₯κ° λ°μνλ κ²½μ°, λ€μν μμΈμ΄ μμ μ μμ΅λλ€. λ€μμ λͺ κ°μ§ κ°λ₯μ±μ λν΄ μ€λͺ
ν©λλ€:
1. 리μμ€ νκ³: μλ² λλ μ¬μ© μ€μΈ λ°μ΄ν°λ² μ΄μ€μ μ°κ²° μλ κΈ°ν 리μμ€ νκ³μ λλ¬ν μ μμ΅λλ€. ν μκ°μ 100ν μμ²μ λ³΄λΌ λ, κ° μμ²μ΄ λΉ λ₯΄κ² μ²λ¦¬λμ΄μ 리μμ€κ° λΉ λ₯΄κ² ν΄μ λλ©΄ λ¬Έμ κ° μμ μ μμ΅λλ€. κ·Έλ¬λ 100νμ μμ²μ 20μ΄ λμ μ§μμ μΌλ‘ λ³΄λΌ κ²½μ°, κ·Έ μ¬μ΄μ 리μμ€κ° ν΄μ λμ§ μμ λ¬Έμ κ° λ°μν μ μμ΅λλ€.
2. λ μ΄νΈ 리λ°ν
: μΌλΆ μμ€ν
μ μ§§μ μκ° λμ λ무 λ§μ μμ²μ μ²λ¦¬νλ κ²μ λ°©μ§νκΈ° μν΄ λ μ΄νΈ 리λ°ν
(rate limiting)μ μ¬μ©ν©λλ€. κ·Έλ κΈ° λλ¬Έμ 20μ΄ λμ κΎΈμ€ν μμ²μ 보λ΄λ©΄ μ νμ 걸릴 μ μμ΅λλ€.
3. λ©λͺ¨λ¦¬ λ¬Έμ : μμ²μ΄ λμ λλ©΄μ μλ²μ λ©λͺ¨λ¦¬ μ¬μ©λμ΄ λμ΄λ μ μμ΅λλ€. λ§μ½ λ©λͺ¨λ¦¬κ° λΆμ‘±νλ€λ©΄, μμ² μ²λ¦¬ μ€μ μλ¬κ° λ°μν μ μμ΅λλ€.
4. μ€λ³΅ μμ² μ²λ¦¬: 20μ΄ λμ μ§μμ μΌλ‘ μμ²μ λ³΄λΌ λ, μ΄μ μμ²μ κ²°κ³Όλ μνκ° λ€μ μμ²μ μ²λ¦¬μ μν₯μ μ€ μ μμ΅λλ€. μ€λ³΅λλ μμ²μ΄λ μν λ³κ²½μ΄ μμΈμ΄ λ μ μμ΅λλ€.
5. λ‘κΉ
λλ λͺ¨λν°λ§ λꡬ: μμ²μ μ²λ¦¬νλ λμ λ‘κ·Έλ₯Ό λ¨κΈ°κ±°λ λͺ¨λν°λ§ λꡬλ₯Ό μ¬μ©νλ κ²½μ°, μμ² λΉλκ° λμμ§λ©΄ μΆκ°μ μΈ λΆνκ° λ°μν μ μμ΅λλ€.
+ ν΄κ²° λ°©λ²
+ λ‘κΉ
: λ°μνλ μ€λ₯μ λ‘κ·Έλ₯Ό νμΈνμ¬ κ΅¬μ²΄μ μΈ μμΈμ νμ
ν©λλ€.
+ νλ‘νμΌλ§: μλ²μ CPU, λ©λͺ¨λ¦¬, λ€νΈμν¬ μ¬μ©λ λ±μ 리μμ€ μ¬μ© μν©μ λͺ¨λν°λ§νμ¬ λ³λͺ© νμμ΄λ 리μμ€ λΆμ‘± λ¬Έμ λ₯Ό νμ
ν©λλ€.
+ ν
μ€ν
: μλ²μ μμ²μ 보λ΄λ λ€μν μλ리μ€λ₯Ό ν
μ€νΈνμ¬ λ¬Έμ μ μμΈμ νμ
ν©λλ€.
μ΄λ¬ν μ κ²μ ν΅ν΄ λ¬Έμ μ μμΈμ νμ
νκ³ μ μ ν μ‘°μΉλ₯Ό μ·¨ν μ μμ΅λλ€.
#### Replicas
λ¨μ: [Sec]
|Server Arch.|Mean(Serial)|End(Serial)|Mean(Concurrency)|End(Concurrency)|Mean(Random)|End(Random)|
|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
|Rep=1|0.691|77.682|14.501|96.777|20.415|104.487|
|Rep=5|0.629|71.094|9.767|47.726|14.391|61.767|
Replica μμ μ¦κ°λ₯Ό ν΅ν΄ APIμ μλ΅μ λΉ λ₯΄κ² ν μ μμμ νμΈνλ€. ([νλλ μλΉμ€μ ν΅μ νλλ‘ κ΅¬μ±ν μ μμΌλ©°, μλΉμ€μμ ν΅μ μ μλΉμ€μ λ§΄λ² μ€ μΌλΆ νλμ μλμ μΌλ‘ λ‘λ-λ°Έλ°μ± λλ€.](https://kubernetes.io/ko/docs/tutorials/services/connect-applications-service/#%EC%84%9C%EB%B9%84%EC%8A%A4-%EC%83%9D%EC%84%B1%ED%95%98%EA%B8%B0))
νΉν λμμ νΈμΆ μ ν° ν₯μμ΄ μμμ νμΈν μ μλ€.
WORKER TIMEOUT
`fastapi`μ replicaλ 1κ°, `triton-inference-server`μ replicaλ 5κ° μΌ λλ λ°μνμ§ μλ μ€λ₯κ° `fastapi`μ replicaλ 5κ°, `triton-inference-server`μ replicaλ 5κ° μΌ λ μλμ κ°μ΄ λ°μνλ€.
μ΄κ²μ `"--timeout", "120"`μ `Dockerfile`μ μΆκ°νμ¬ ν΄κ²°νλ€.
```bash
[1] [CRITICAL] WORKER TIMEOUT (pid:8)
[1] [WARNING] Worker with pid 8 was terminated due to signal 6
[379] [INFO] Booting worker with pid: 379
[379] [INFO] Started server process [379]
[379] [INFO] Waiting for application startup.
[379] [INFO] Application startup complete.
```
#### Autoscaling
`HPA` μ¬μ© μ ν μκ°μ 100νμ μμ²μ΄ μ
λ ₯λλ©΄ replicaλ₯Ό μμ±νκΈ° μ μ λ¨μΌ `fastapi` podμ μ
λ ₯λκΈ° λλ¬Έμ autoscaling ν¨κ³Όλ₯Ό λ³Ό μ μλ€.
λ°λΌμ autoscalingμ μνν νλ €λ©΄ `Resource` κΈ°μ€μ΄ μλ μλ‘μ΄ `metrics`κ° νμνλ€.
μμ: hpa.yaml
```yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: triton-inference-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference-server
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
---
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
name: fastapi-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: fastapi
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
```
### [3.4. Gradio](https://github.com/Zerohertz/YOLO-Serving-Cookbook/tree/3.Kubernetes-4.Gradio)
Architecture

