https://github.com/nomnomnonono/ml-pipeline-of-paper-category-classification
ML pipeline for article category classification, with a scheduler to automatically retrieve article information from Arxiv, train and infer, deploy classification models in a single command.
https://github.com/nomnomnonono/ml-pipeline-of-paper-category-classification
arxiv docker google-cloud-platform kubeflow machine-learning mlops poetry python
Last synced: 6 months ago
JSON representation
ML pipeline for article category classification, with a scheduler to automatically retrieve article information from Arxiv, train and infer, deploy classification models in a single command.
- Host: GitHub
- URL: https://github.com/nomnomnonono/ml-pipeline-of-paper-category-classification
- Owner: nomnomnonono
- Created: 2023-08-26T04:43:53.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-14T02:32:05.000Z (about 2 years ago)
- Last Synced: 2025-02-07T07:32:06.604Z (8 months ago)
- Topics: arxiv, docker, google-cloud-platform, kubeflow, machine-learning, mlops, poetry, python
- Language: Python
- Homepage:
- Size: 305 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ML-Pipeline-of-Paper-Category-Classification
Arxiv APIで取得した論文データについて、タイトルから主カテゴリを予測する分類器を学習、デプロイする## Requirements
- Poetry
- gcloud CLI
- docker compose## Setup
### GCP Authentification
```bash
$ gcloud auth login
$ gcloud components install pubsub-emulator
```### Install Dependencies
```bash
$ make install
```### Environmental Variables
```bash
$ vi .env
```- 以下の情報を記入+環境変数としてexportしておく
```bash
GCP_PROJECT_ID=your project id
TOPIC_ID=your topic id
AR_REPOSITORY_NAME=artifact registory repository name
LOCATION=asia-northeast1
DATA_BUCKET=gs://xxx
SOURCE_CSV_URI=gs://xxx/data.csv
CONFIG_FILE_URI=gs:/xxx/config.json
ROOT_BUCKET=gs://yyy
JOB_NAME=cloud run job name
SCHEDULER_NAME=cloud scheduler name
DATASET_NAME=dataset name
TABLE_NAME=table name
BQ_FUNC_NAME=cloud functions name to use bigquery
PIPELINE_NAME=vertex ai pipelines name
```## Boot MLflow Server
```bash
$ make mlflow
```## Build & Push Docker Image
```bash
$ gcloud auth configure-docker asia-northeast1-docker.pkg.dev
$ gcloud artifacts repositories create $AR_REPOSITORY_NAME --location=$LOCATION --repository-format=docker
$ docker compose build
$ docker compose push
```## Deploy Cloud Functions to Use BiqQuery
データセットが更新されたらBigQueryも自動更新する関数をデプロイする
```bash
$ make deploy_bq_func
```## Cloud Run Job to Scrape Paper Data
### Deploy
```bash
$ make deploy_job
```### Exec
下記コマンドを実行すると前回実行時点からの差分となる論文情報が取得される
```bash
$ make exec_job
```### Create Scheduler
Cloud Run Jobの定期実行をしたい場合は下記コマンドを実行してください(デフォルトは月1回)
```bash
$ make create_scheduler
```## Exec Pipeline
```bash
$ make pipeline
```