{"id":26239636,"url":"https://github.com/cornerstone-ondemand/modelkit-imdb","last_synced_at":"2026-04-20T10:05:39.886Z","repository":{"id":48929599,"uuid":"380006265","full_name":"Cornerstone-OnDemand/modelkit-imdb","owner":"Cornerstone-OnDemand","description":"NLP sample project leveraging modelkit and the imdb reviews dataset","archived":false,"fork":false,"pushed_at":"2021-07-07T09:01:51.000Z","size":15631,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-10-25T21:38:46.310Z","etag":null,"topics":["fastapi","imdb-dataset","machine-learning","mlops","modelkit","natural-language-processing","nlp","python","rest-api","spacy","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Cornerstone-OnDemand.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-06-24T17:42:07.000Z","updated_at":"2025-08-25T16:59:25.000Z","dependencies_parsed_at":"2022-09-14T08:50:21.274Z","dependency_job_id":null,"html_url":"https://github.com/Cornerstone-OnDemand/modelkit-imdb","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Cornerstone-OnDemand/modelkit-imdb","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cornerstone-OnDemand%2Fmodelkit-imdb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cornerstone-OnDemand%2Fmodelkit-imdb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cornerstone-OnDemand%2Fmodelkit-imdb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cornerstone-OnDemand%2Fmodelkit-imdb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Cornerstone-OnDemand","download_url":"https://codeload.github.com/Cornerstone-OnDemand/modelkit-imdb/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Cornerstone-OnDemand%2Fmodelkit-imdb/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32042311,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"online","status_checked_at":"2026-04-20T02:00:06.527Z","response_time":94,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","imdb-dataset","machine-learning","mlops","modelkit","natural-language-processing","nlp","python","rest-api","spacy","tensorflow"],"created_at":"2025-03-13T07:16:43.567Z","updated_at":"2026-04-20T10:05:39.849Z","avatar_url":"https://github.com/Cornerstone-OnDemand.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/clustree/modelkit\"\u003e\n    \u003cimg src=\"https://raw.githubusercontent.com/clustree/modelkit/main/.github/resources/logo.svg\" alt=\"Logo\" width=\"80\" height=\"80\" /\u003e\n\u003c/a\u003e\u003cspan style=\"font-size:30px; margin: 0px 20px 0px 10px; padding-bottom: 100px\"\u003ex\u003c/span\u003e\n\u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/6/69/IMDB_Logo_2016.svg\" width=\"100\" height=\"80\"/\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eHow to deploy a NLP pipeline\u003c/h1\u003e\n\u003ch3 align=\"center\"\u003eleveraging \u003ca href=\"https://github.com/clustree/modelkit\"\u003emodelkit\u003c/a\u003e and the IMDB reviews dataset\u003c/h3\u003e\n\n\u003ch4 align=\"center\"\u003e\n  \u003cem\u003eFeatures Covered: Installation, Project Organization, Assets Management, CLIs, REST API Serving\u003c/em\u003e\n\u003c/h4\u003e\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/clustree/modelkit-imdb/actions?query=branch%3Amain+\"\u003e\u003cimg src=\"https://img.shields.io/github/workflow/status/clustree/modelkit-imdb/CI/main\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/clustree/modelkit-imdb/actions/workflows/main.yml?query=branch%3Amain+\"\u003e\u003cimg src=\"docs/badges/tests.svg\" /\u003e\u003c/a\u003e\n  \u003ca href=\"https://clustree.github.io/modelkit-imdb/coverage/index.html\"\u003e\u003cimg src=\"docs/badges/coverage.svg\" /\u003e\u003c/a\u003e\n\u003cimg src=\"https://img.shields.io/static/v1?label=python\u0026message=3.7\u0026color=blue\" /\u003e\n  \u003ca href=\"https://github.com/clustree/modelkit-imdb/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/github/license/clustree/modelkit-imdb\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nThis sample project aims at illustrating modelkit's powerful features, based on the documentation tutorial: [NLP x Sentiment Analysis](https://clustree.github.io/modelkit/examples/nlp_sentiment/intro.html).\n\nIt also serves as a sandbox for any developer willing to try out modelkit in _real conditions_, from the package organization to the Github CI and through the use of CLIs and HTTP serving.\n\n## TL;DR\n\nJump to the result hosted on `Heroku` at https://modelkit-imdb.herokuapp.com/docs/.\n\n## Installation\n\nFirst, please `source .env` or run the following:\n\n```bash\nexport MODELKIT_ASSETS_DIR=.local_storage\nexport MODELKIT_STORAGE_BUCKET=.\nexport MODELKIT_STORAGE_PREFIX=.remote_storage\nexport MODELKIT_STORAGE_PROVIDER=local\nexport MODELKIT_DEFAULT_PACKAGE=modelkit_imdb  # shortcut for CLIs\n```\n\nOnce done, let's create a new python virtual environment and install the dev requirements:\n\n```bash\npip install -r requirements-dev.txt\n```\n\n## Models listing\n\nBefore going further, let's remind us of the models that are available in the `modelkit_imdb` package with the following CLI:\n\n```bash\nmodelkit describe\n```\n\n\u003cimg src=\".github/resources/describe.gif\" style=\"margin: 20px\" /\u003e\n\n## Project Organization\n\nmodelkit encourages you to organize your project as a python package to benefit from the python software tools and clearness.\n\nHence, this sample project was first arranged following this minimal python package structure:\n```\nmodelkit-imdb  # project directory\n├── modelkit_imdb  # the python package name\n│   ├── __init__.py  # the different models implemented\n│   ├── classifiers.py\n│   ├── tokenizers.py\n│   └── vectorizers.py\n├── requirements.txt\n├── setup.cfg  # setup configurations for our package\n├── setup.py\n└── tests  # tests outside the modelkit_imdb package\n    ├── __init__.py\n    ├── conftest.py\n    └── test_auto_testing.py\n```\n\nThis way, your package can easily be shared and set up across your organization using pip, via an artifactory or git:\n```\npip install git+https://github.com/clustree/modelkit-imdb.git                                                                                   10:41:07 \n```\n\nBefore using it as a regular python package, with modelkit's support:\n```python\nimport modelkit_imdb\n\nfrom modelkit import ModelLibrary\n\n # use the ModelLibrary to automatically discover and load models from modelkit_imdb\n\nlibrary = ModelLibrary(models=modelkit_imdb)\ntokenizer = library.get(\"imdb_tokenizer\")\ntokenizer.predict(\"I love this movie!\")\n```\n\nIn addition to this minimal python package structure, several files were also added for deployment and the github CI:\n- `Dockerfile`\u0026 `heroku.yml`: to automatically deploy on Heroku after on the `main` branch once tests pass\n- `Makefile`: to remind us how to lint, test or compute the coverage rate\n- `.gitignore`: to prevent from versioning non-wanted files\n- `noxfile.py`: to tests on multiple python environment, useful for the CI\n- etc.\n\nYou are now all set to write your machine learning models following software engineering's and modelkit's best practices.\n\n## Assets Management\n\nFor this sample project, we emulated a remote storage locally at `.remote_storage`, which contains all the differet artefacts created so far:\n- `vocabulary.txt`, for the `imdb_vectorizer`\n- `model.h5`: for the `imdb_classifier` \n\nIn production, you may be using / want to use `AWS S3`, `GCS` or whatever (safely) configured remote storage: modelkit has your back and provides driver to directly read and write on those providers (you can also write your own!).\n\nWhen using a model, modelkit automatically retrieves its different assets _from_ the remote storage, and safely caches _to_ a local storage.\n\nTo better understand how it all works, (modelkit's versioning and Assets Management), let's restart from scratch.\n\nMake sure the `.env` file was sourced to set the environment variables needed by modelkit.\n\nFirst, let's grab the different assets from the current local storage before dropping it as well as the remote storage:\n\n```\nmkdir -p tmp/classifier tmp/vectorizer\ncp .local_storage/imdb/classifier/0.0/model.h5 tmp/classifier\ncp .local_storage/imdb/vectorizer/0.0/vocabulary.txt tmp/vectorizer\nrm -rf .local_storage/imdb .local_storage/.cache .remote_storage/imdb .remote_storage/.cache\n```\n\nNow, let's use `modelkit` assets management CLI to version our two assets along with their directories, under the `imdb` namespace:\n```\n# modelkit assets new [PATH] [NAMESPACE/NAME] \n\nmodelkit assets new tmp/vectorizer imdb/vectorizer\n\n# Current assets manager:\n#  - storage provider = `\u003cLocalStorageDriver bucket=.\u003e`\n#  - bucket = `.`\n#  - prefix = `.remote_storage`\n# Current asset: `imdb/vectorizer`\n#  - name = `imdb/vectorizer`\n# Push a new asset `imdb/vectorizer` with version `0.0`?\n# [y/N]: y\n```\n\n```\nmodelkit assets new tmp/classifier imdb/classifier\n# ...\n```\n\nThe just-versioned assets are now located under the `.remote_storage` directory, along with their version and metadata.\n\nThey now can be used as part of modelkit models, in the `CONFIGURATIONS` map:\n\n```python\nclass Classifier(modelkit.Model[MovieReviewItem, MovieSentimentItem]):\n    CONFIGURATIONS = {\n        \"imdb_classifier\": {\n            \"asset\": \"imdb/classifier:0.0[/model.h5]\",  # namespace/name:version[subfile]\n            \"model_dependencies\": {\n                \"tokenizer\": \"imdb_tokenizer\",\n                \"vectorizer\": \"imdb_vectorizer\",\n            },\n        }\n    }\n```\n\nAs you can see, they are *pinned* to a given version so that you can freely update them hurting production:\n\n```\nmodelkit assets update tmp/classifier imdb/classifier\n\n# Current assets manager:\n#  - storage provider = `\u003cLocalStorageDriver bucket=.\u003e`\n#  - bucket = `.`\n#  - prefix = `.remote_storage`\n# Current asset: `imdb/classifier`\n#  - name = `imdb/classifier`\n#  - major version = `None`\n#  - minor version (ignored) = `None`\n# Found a total of 1 versions (1 major versions) for `imdb/classifier`\n#  - major `0` = 0.0\n# Push a new asset version `0.1` for `imdb/classifier`?\n# [y/N]: \n```\n\nThey are then retrieved and cached in `.local_storage` once called by at a model:\n\n```python\nfrom modelkit import ModelLibrary\n\nlib = ModelLibrary(models=\"modelkit_imdb\")  # package or path to the package\nclassifier = lib.get(\"imdb_classifier\")\nclassifier.predict({\"text\": \"I love this movie so much\"})\n# MovieSentimentItem(label='good', score=0.6999041438102722)\n```\n\nThat's it, make sure to clean the `./tmp` folder before leaving!\n\n## Tests\n\nLet's make sure everything works as intended by running some tests.\n```python\npytest\n```\n\nAs you can see in the `tests/`subfolder, two things were added:\n- in `conftest.py`: a pytest fixture `model_library` which creates a `ModelLibrary` with all the models found in the package\n```python\nfrom modelkit.testing.fixtures import modellibrary_fixture\n\nmodellibrary_fixture(\n    models=modelkit_imdb,\n    fixture_name=\"model_library\",\n)\n```\n- in `test_auto_testing.py`: a test which iterates through all `modelkit_imdb` models to find tests and run them, using the just-defined `model_library` fixture:\n```python\nfrom modelkit.testing import modellibrary_auto_test\n\nmodellibrary_auto_test(\n    models=modelkit_imdb,\n    fixture_name=\"model_library\"\n)\n```\n\nFor more info, head over to [Testing](https://clustree.github.io/modelkit/library/models/testing.html).\n\n## HTTP serving\n\nThe following CLI will start a single worker which will expose all the models found under the `modelkit_imdb` package leveraging [uvicorn](https://www.uvicorn.org/) and [FastAPI](https://fastapi.tiangolo.com/):\n\n```bash\nmodelkit serve\n```\n\nVoilà: the [uvicorn](https://www.uvicorn.org/) worker is now running at `http://localhost:8000`.\n\nmodelkit also provides out-of-the-box support for [gunicorn](https://docs.gunicorn.org/en/stable/):\n```bash\ngunicorn --workers 4 \\\n         --bind 0.0.0.0:8000 \\\n         --preload \\\n         --worker-class=uvicorn.workers.UvicornWorker \\\n         'modelkit.api:create_modelkit_app()'\n```\n\nCheck out the generated `SwaggerUI` at http://localhost:8000/docs to see all the endpoints and try them out:\n\n\u003cimg src=\".github/resources/swagger.gif\" alt=\"modelkit swagger\" style=\"margin: 20px\"\u003e\n\nOf course, you can also `POST` your request on the endpoint of your choice:\n\n```bash\ncurl -X 'POST' \\\n  'http://localhost:8000/predict/imdb_classifier' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n  \"text\": \"This movie sucks! It is the worst I have ever seen in my entire life\"\n}'\n# {\"label\":\"bad\",\"score\": 0.1530771553516388}\n```\n\n## Deployment example\n\nTo conclude this sample project, a minimal `Dockerfile` was written as well as a `heroku.yml` file so that to host our different models on `Heroku` at https://modelkit-imdb.herokuapp.com/docs/.\n\nYou can run it locally using [Docker](https://www.docker.com/) and enjoy the Swagger at: http://localhost:8000/docs:\n\n```bash\ndocker build -t modelkit-imdb .\ndocker run -p 8000:8000 -e PORT=8000 modelkit-imdb \n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcornerstone-ondemand%2Fmodelkit-imdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcornerstone-ondemand%2Fmodelkit-imdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcornerstone-ondemand%2Fmodelkit-imdb/lists"}