https://github.com/nomnomnonono/ml-pipeline-of-paper-category-classification

ML pipeline for article category classification, with a scheduler to automatically retrieve article information from Arxiv, train and infer, deploy classification models in a single command.
https://github.com/nomnomnonono/ml-pipeline-of-paper-category-classification

arxiv docker google-cloud-platform kubeflow machine-learning mlops poetry python

Last synced: 6 months ago
JSON representation

ML pipeline for article category classification, with a scheduler to automatically retrieve article information from Arxiv, train and infer, deploy classification models in a single command.

Host: GitHub
URL: https://github.com/nomnomnonono/ml-pipeline-of-paper-category-classification
Owner: nomnomnonono
Created: 2023-08-26T04:43:53.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-09-14T02:32:05.000Z (about 2 years ago)
Last Synced: 2025-02-07T07:32:06.604Z (8 months ago)
Topics: arxiv, docker, google-cloud-platform, kubeflow, machine-learning, mlops, poetry, python
Language: Python
Homepage:
Size: 305 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ML-Pipeline-of-Paper-Category-Classification
Arxiv APIで取得した論文データについて、タイトルから主カテゴリを予測する分類器を学習、デプロイする

## Requirements
- Poetry
- gcloud CLI
- docker compose

## Setup
### GCP Authentification
```bash
$ gcloud auth login
$ gcloud components install pubsub-emulator
```

### Install Dependencies
```bash
$ make install
```

### Environmental Variables
```bash
$ vi .env
```

- 以下の情報を記入＋環境変数としてexportしておく
```bash
GCP_PROJECT_ID=your project id
TOPIC_ID=your topic id
AR_REPOSITORY_NAME=artifact registory repository name
LOCATION=asia-northeast1
DATA_BUCKET=gs://xxx
SOURCE_CSV_URI=gs://xxx/data.csv
CONFIG_FILE_URI=gs:/xxx/config.json
ROOT_BUCKET=gs://yyy
JOB_NAME=cloud run job name
SCHEDULER_NAME=cloud scheduler name
DATASET_NAME=dataset name
TABLE_NAME=table name
BQ_FUNC_NAME=cloud functions name to use bigquery
PIPELINE_NAME=vertex ai pipelines name
```

## Boot MLflow Server
```bash
$ make mlflow
```

## Build & Push Docker Image
```bash
$ gcloud auth configure-docker asia-northeast1-docker.pkg.dev
$ gcloud artifacts repositories create $AR_REPOSITORY_NAME --location=$LOCATION --repository-format=docker
$ docker compose build
$ docker compose push
```

## Deploy Cloud Functions to Use BiqQuery
データセットが更新されたらBigQueryも自動更新する関数をデプロイする
```bash
$ make deploy_bq_func
```

## Cloud Run Job to Scrape Paper Data
### Deploy
```bash
$ make deploy_job
```

### Exec
下記コマンドを実行すると前回実行時点からの差分となる論文情報が取得される
```bash
$ make exec_job
```

### Create Scheduler
Cloud Run Jobの定期実行をしたい場合は下記コマンドを実行してください（デフォルトは月1回）
```bash
$ make create_scheduler
```

## Exec Pipeline
```bash
$ make pipeline
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nomnomnonono/ml-pipeline-of-paper-category-classification

Awesome Lists containing this project

README