{"id":25010118,"url":"https://github.com/kyopark2014/aws-sagemaker","last_synced_at":"2025-10-03T20:33:29.822Z","repository":{"id":60226074,"uuid":"478510673","full_name":"kyopark2014/aws-sagemaker","owner":"kyopark2014","description":"It shows how to develop a ML model for an actual case using AWS SageMaker.","archived":false,"fork":false,"pushed_at":"2023-07-19T07:34:00.000Z","size":2295,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-26T13:46:03.777Z","etag":null,"topics":["aws","machine-learning","pytorch","sagemaker","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kyopark2014.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-04-06T10:27:34.000Z","updated_at":"2022-11-20T10:12:34.000Z","dependencies_parsed_at":"2023-01-20T17:17:00.393Z","dependency_job_id":null,"html_url":"https://github.com/kyopark2014/aws-sagemaker","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyopark2014%2Faws-sagemaker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyopark2014%2Faws-sagemaker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyopark2014%2Faws-sagemaker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kyopark2014%2Faws-sagemaker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kyopark2014","download_url":"https://codeload.github.com/kyopark2014/aws-sagemaker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248621400,"owners_count":21134838,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","machine-learning","pytorch","sagemaker","xgboost"],"created_at":"2025-02-05T04:52:40.892Z","updated_at":"2025-10-03T20:33:24.780Z","avatar_url":"https://github.com/kyopark2014.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS SageMaker\n\n\nSageMaker는 AWS의 완전 관리형 머신 러닝 학습 서비스로서, 데이터 과학자가 빠르고 쉽게 모델 개발 및 학습을 할 수 있도록 지원합니다. \n\n\n## SageMaker Training\n\nSageMaker에서 제공되는 jupyter Notebook을 통해, 학습에 필요한 데이터를 전처리하거나, 모델을 개발할 수 있습니다. 하지만, 노트북 인스턴스에서 모델 학습을 수행할 수 있지만, 더 높은 성능의 CPU/GPU를 요구할때 노트북 인스턴스를 Scale-Up 하는것은 비용적으로 효율적이지 않습니다. 따라서, 별도 인스턴스를 띄워서 모델 학습을 진행하는데 이것을 SageMaker Training이라고 합니다.\n\n학습을 위해서 S3에 학습에 필요한 데이터를 업로드합니다. 이후, SageMaker가 학습 클러스터로 S3의 학습데이터를 가져와서 학습을 수행하게 됩니다. 이때, 학습에 필요한 코드는 노트북에서 로드하여 학습클러스터에서 사용합니다. \n\n## 학습용 Container \n\nSageMaker에서 [학습용 Container 생성시 폴더의 경로 및 환경변수](https://github.com/kyopark2014/aws-sagemaker/blob/main/training-container.md)에 대해 설명합니다. \n\n## 학습용 Cluster 정의\n\n학습 Cluster 사용할 IAM role과 Hyperparameter를 아래와 같이 정의합니다. 여기서, sagemaker.get_execution_role()을 하면 현재 노트북의 role을 가져옵니다. 별도의 role을 사용할 경우에 해당 role의 arn을 입력합니다. \n\n```python\nimport sagemaker \n\nsagemaker_session = sagemaker.Session()\t \t# SageMaker 세션 정의\nrole = sagemaker.get_execution_role()\t\t# SageMaker 노트북에서 사용하는 role 활용\n```\n\nHyperparameter를 정의합니다. \n\n```python\nhyperparameters = {“batch_size” : 32 ,\n\t\t   “lr” : 1e-4 , \n\t\t   “image_size” : 128 }\t\t# 학습 코드의 arguments 값\n```\n\n\n학습 클러스터의 인스턴스 종류/수, 실행할 학습 코드, 학습 환경 컨테이너 등을 Estimator로 정의합니다. \n\n```python\nfrom sagemaker.pytorch import PyTorch \n\nestimator = PyTorch( \n\tsource_dir=\"code\",                                   \t# 학습 코드 폴더 지정\n\tentry_point=\"train_pytorch_smdataparallel_mnist.py\",\t# 실행 학습 스크립트 명\n\trole=role, \t\t\t\t\t\t# 학습 클러스터에서 사용할 Role\n\tframework_version=\"1.10\",\t\t\t\t# Pytorch 버전\n\tpy_version=\"py38\", \t\t\t\t\t# Python 버전\n\tinstance_count=1,        \t\t\t\t# 학습 인스턴스 수\n\tinstance_type=\"ml.p4d.24xlarge\",             \t\t# 학습 인스턴스 명\n\tsagemaker_session=sagemaker_session,\t\t\t# SageMaker 세션\n\thyperparameters=hyperparameters,\t\t\t# 하이퍼파라미터 설정\n)\n```\n\n추가적으로 아래와 같은 파라메터를 estimater에서 추가하여 사용할수 있습니다. \n\n```python\nestimator = PyTorch( \n\t… ,\n\tmax_run=5*24*60*60,\t\t\t# 최대 학습 수행 시간 (초)\n\tuse_spot_instances=True, \t\t# spot 인스턴스 사용 여부\n\tmax_wait=3*60*60, \t\t\t# spot 사용 시 자원 재확보를 위한 대기 시간\n\tcheckpoint_s3_uri= checkpoint_s3_uri,    # checkpoints 저장 S3 위치\n\t…\t\t\n)\n```\n\n\n## Data path\n\n학습할때 사용할 수 있는 data path에는 S3, EFS, FSx for Lustre 등 3가지 타입이 가능합니다.\n\n```python\n# S3 \ndata_path = “s3://my_bucket/my_training_data/”\n\n# EFS\ndata_path = FileSystemInput(file_system_id='fs-1’, file_system_type='EFS’,\n\t\t\t   directory_path=‘/dataset’, file_system_access_mode=‘ro’)\n\n# FSx for Lustre\ndata_path = FileSystemInput(file_system_id='fs-2’, file_system_type='FSxLustre’, \n\t\t\t   directory_path='/\u003cmount-id\u003e/dataset’, \n\t\t\t   file_system_access_mode='ro’)\n```\n\nS3에서 파일을 복사하는 시간이 오래 걸리면, GPU를 가진 프로세서가 기다려야하므로, Lustre를 검토할 수 있습니다. (수십 GB 이상) \n\nFileSystemInput으로 사용할때는 복사하지 않고 마운트하여 읽어오게 됩니다. \n\n\n## 학습 시작\n\n학습 클러스터에서 사용할 데이터 경로와 channel_name을 선언한 후 실행합니다.\n\n```python\nchannel_name = ”training”\n\nestimator.fit(\n\tinputs={channel_name : data_path},\n\tjob_name=job_name\n)\n```\n\n## Local Mode Debugging\n\n생성한 SageMaker Notebook에서 학습 코드를 개발할 목적으로 Local Mode로 사용할 수 있습니다.\n\n딥러닝 분산학습의 경우 노트북 인스턴스를 GPU 유형으로 생성합니다. 단, SageMaker의 Data parallel과 Model parallel Library는 ml.p3.16xlarge이상에서 테스트 가능합니다. 이것은 임시적 사용이며 비용을 위해 테스트 후 CPU 유형으로 변경하는것이 좋습니다.\n\n- Local Mode \n\u003cimg width=\"658\" alt=\"image\" src=\"https://user-images.githubusercontent.com/52392004/190835603-a4ae3ab8-efeb-4d46-8312-4772ca49a675.png\"\u003e\n\n- 실제 학습 Mode\n\u003cimg width=\"658\" alt=\"image\" src=\"https://user-images.githubusercontent.com/52392004/190835617-68cf5d32-cb0f-436f-b9f0-5684302521fc.png\"\u003e\n\n## Matric\n\n### Matric definition\n\n학습 코드에서 아래와 같은 로그를 찍는다고 가정하면, Train_Loss를 matric으로 만들어 사용하고 싶을 수 있습니다.\n\n```python\nEpoch : [2][6/10] Train_Time = 0.355 : (3.134) , Train_Speed = 1803.360 (204.190), Train_Loss = 1.0813320875 : (1.3528) , Train_Prec@1 = 76.250 : (68.958)\n```\n\n이때, matric 정보를 hooking하여 아래처럼 사용할 수 있습니다. \n\n```python\nmetric_definitions = [ { ‘Name’ : ‘train:Loss’, ‘Regex’ : ‘Train_Loss = (.*?) :’}, ...]\n```\n\n이후, estimator 정의시 아래처럼 matric_definitions을 추가합니다. \n\n```python\nestimator = PyTorch( \n\tsource_dir=\"code\",                                   \t# 학습 코드 폴더 지정\n\tentry_point=\"train_pytorch_smdataparallel_mnist.py\",\t# 실행 학습 스크립트 명\n\trole=role, \t\t\t\t\t\t# 학습 클러스터에서 사용할 Role\n\tframework_version=\"1.10\",\t\t\t\t# Pytorch 버전\n\tpy_version=\"py38\", \t\t\t\t\t# Python 버전\n\tinstance_count=1,        \t\t\t\t# 학습 인스턴스 수\n\tinstance_type=\"ml.p4d.24xlarge\",             \t\t# 학습 인스턴스 명\n\tsagemaker_session=sagemaker_session,\t\t\t# SageMaker 세션\n\thyperparameters=hyperparameters,\t\t\t# 하이퍼파라미터 설정\n\tmetric_definitions=metric_definitions,       \t\t# Matric definitions\n)\n```\n\n## SageMaker Basic\n\n[SageMaker Training](https://github.com/kyopark2014/aws-sagemaker/tree/main/training-basic)에서는 xgboost를 이용한 보험사기를 검출하는 예제를 설명하고 있습니다. \n\n\n## SageMaker Experiment\n\n[SageMaker Experiment와 Trial](https://github.com/kyopark2014/aws-sagemaker/blob/main/sagemaker-experiment.md)을 이용하여 여러 시도에 대해 사용자의 하이퍼파라미터, 평가 지표(metrics) 등을 기록 및 추적할 수 있습니다. \n\n\n## SageMaker Processing \n\n[SageMaker Processing](https://github.com/kyopark2014/aws-sagemaker/blob/main/sagemaker-processing.md)으로 사전 처리, 후 처리 및 모델 평가를 실행할 수 있는 환경을 제공합니다. S3의 데이터를 입력으로 받아 로직 처리 후 S3에 출력으로 저장후 SageMaker에서 dataset으로 사용할 수 있습니다. \n\n## Workshop\n\n[SageMaker Immersion Day Workshop](https://github.com/kyopark2014/aws-sagemaker/tree/main/workshop)에 대해 설명합니다.\n\n## Monitoring\n\nGPU/CPU 리소스 사용량은 아래처럼 CloudWatch를 통해 확인할 수 있습니다.\n\n\u003cimg width=\"725\" alt=\"image\" src=\"https://user-images.githubusercontent.com/52392004/190836077-464e9d89-8188-4814-8d8c-f8026ae55a5c.png\"\u003e\n\n\n\n## Reference\n\n[Amazon SageMaker 모델 학습 방법 소개 - AWS AIML 스페셜 웨비나](https://www.youtube.com/watch?v=oQ7glJfD-BQ\u0026list=PLORxAVAC5fUULZBkbSE--PSY6bywP7gyr)\n\n[SageMaker 스페셜 웨비나 - Github](https://github.com/aws-samples/aws-ai-ml-workshop-kr/tree/master/sagemaker/sm-special-webinar)\n\n[Direct Marketing with Amazon SageMaker XGBoost and Hyperparameter Tuning (SageMaker SDK)](https://sagemaker-examples.readthedocs.io/en/latest/hyperparameter_tuning/xgboost_direct_marketing/hpo_xgboost_direct_marketing_sagemaker_APIs.html)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkyopark2014%2Faws-sagemaker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkyopark2014%2Faws-sagemaker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkyopark2014%2Faws-sagemaker/lists"}