{"id":14970680,"url":"https://github.com/rclement/datasette-ml","last_synced_at":"2025-10-26T13:31:10.533Z","repository":{"id":153483256,"uuid":"618792846","full_name":"rclement/datasette-ml","owner":"rclement","description":"A Datasette plugin providing an MLOps platform to train, eval and predict machine learning models","archived":false,"fork":false,"pushed_at":"2024-09-10T13:18:23.000Z","size":653,"stargazers_count":15,"open_issues_count":11,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-09-28T13:23:04.645Z","etag":null,"topics":["ai","datasette","datasette-plugin","machine-learning","mlops","python","scikit-learn","sql","sqlite"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rclement.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-25T11:18:48.000Z","updated_at":"2024-07-02T17:18:20.000Z","dependencies_parsed_at":"2023-05-27T15:15:31.746Z","dependency_job_id":"c1830ccf-811e-4147-93f1-92f4d4cc44dc","html_url":"https://github.com/rclement/datasette-ml","commit_stats":{"total_commits":92,"total_committers":3,"mean_commits":"30.666666666666668","dds":0.4565217391304348,"last_synced_commit":"6f0077e3c877e58736d5574def6ea9f3a4a56ca2"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rclement%2Fdatasette-ml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rclement%2Fdatasette-ml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rclement%2Fdatasette-ml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rclement%2Fdatasette-ml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rclement","download_url":"https://codeload.github.com/rclement/datasette-ml/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219862887,"owners_count":16555951,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","datasette","datasette-plugin","machine-learning","mlops","python","scikit-learn","sql","sqlite"],"created_at":"2024-09-24T13:43:58.949Z","updated_at":"2025-10-26T13:31:10.131Z","avatar_url":"https://github.com/rclement.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Datasette ML\n\n\u003e Bringing Machine Learning models near your data, not the other way around!\n\nDatasette ML is a [Datasette](https://datasette.io) plugin providing an MLOps\nplatform to train, evaluate and make predictions from machine learning models.\n\nAll the underlying features are provided by [`sqlite-ml`](https://github.com/rclement/sqlite-ml).\n\n[![PyPI](https://img.shields.io/pypi/v/datasette-ml.svg)](https://pypi.org/project/datasette-ml/)\n[![CI/CD](https://github.com/rclement/datasette-ml/actions/workflows/ci-cd.yml/badge.svg)](https://github.com/rclement/datasette-ml/actions/workflows/ci-cd.yml)\n[![Coverage Status](https://img.shields.io/codecov/c/github/rclement/datasette-ml)](https://codecov.io/gh/rclement/datasette-ml)\n[![License](https://img.shields.io/github/license/rclement/datasette-ml)](https://github.com/rclement/datasette-ml/blob/master/LICENSE)\n\n\u003c!-- Try out a live demo at [https://datasette-ml-demo.vercel.app](https://datasette-ml-demo.vercel.app/-/dashboards) --\u003e\n\n**WARNING**: this plugin is still experimental and not ready for production.\nSome breaking changes might happen between releases before reaching a stable version.\nUse it at your own risks!\n\n\u003c!-- ![Datasette ML Demo](https://raw.githubusercontent.com/rclement/datasette-ml/master/demo/datasette-ml-demo.png) --\u003e\n\n## Installation\n\nInstall this plugin in the same environment as Datasette:\n\n```bash\n$ datasette install datasette-ml\n```\n\n## Usage\n\nDefine configuration within `metadata.yml` / `metadata.json`:\n\n```yaml\nplugins:\n  datasette-ml:\n    db: sqml\n```\n\nA new menu entry is now available, pointing at `/-/ml` to access the MLOps dashboard.\n\n### Configuration properties\n\n| Property | Type     | Description                                     |\n| -------- | -------- | ----------------------------------------------- |\n| `db`     | `string` | Database to store ML models (default is `sqml`) |\n\n## Tutorial\n\nUsing `datasette-ml` you can start training Machine Learning models directly\nalong your data, simply by using custom SQL functions! Let's get started by\ntraining a classifier against the famous \"Iris Dataset\" to predict flower types.\n\n### Loading the dataset\n\nFirst let's load our data. For a real world project, your data may live with its\nown table or being accessed through an SQL view. For the purpose of this tutorial,\nwe can use the `sqml_load_dataset` function to load\n[standard Scikit-Learn datasets](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets):\n\n```sql\nSELECT sqml_load_dataset('iris') AS dataset;\n```\n\nIt will return the following data:\n\n| dataset |\n| --- |\n| {\"table\": \"dataset_iris\", \"feature_names\": [\"sepal length (cm)\", \"sepal width (cm)\", \"petal length (cm)\", \"petal width (cm)\"], \"target_names\": [\"setosa\", \"versicolor\", \"virginica\"], \"size\": 150} |\n\nThe Iris dataset is loaded into a table nammed `dataset_iris`,\ncontaining 150 examples, 4 features and 3 classes to be predicted.\n\n### Training a classifier\n\nNow that our dataset is ready, let's train a first machine learning model to\nperform a classification task using the `sqml_train` function:\n\n```sql\nSELECT sqml_train(\n  'Iris prediction',\n  'classification',\n  'logistic_regression',\n  'dataset_iris',\n  'target'\n) AS training;\n```\n\nIt will return the following data:\n\n| training |\n| --- |\n| {\"experiment_name\": \"Iris prediction\", \"prediction_type\": \"classification\", \"algorithm\": \"logistic_regression\", \"deployed\": true, \"score\": 0.9473684210526315} |\n\nWe have just trained our first machine learning model! The output data informs us\nthat our model has been trained, yields a score of 0.94 and has been deployed.\n\n### Performing predictions\n\nNow that we have trained our classifier, let's use it to make predictions!\n\nPredict the target label for the first row of `dataset_iris` using the\n`sqml_predict` function:\n\n```sql\nSELECT\n  dataset_iris.*,\n  sqml_predict(\n    'Iris prediction',\n    json_object(\n      'sepal length (cm)', [sepal length (cm)],\n      'sepal width (cm)', [sepal width (cm)],\n      'petal length (cm)', [petal length (cm)],\n      'petal width (cm)', [petal width (cm)]\n    )\n  ) AS prediction\nFROM dataset_iris\nLIMIT 1;\n```\n\nThis will output the following data:\n\n| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | prediction |\n| --- | --- | --- | --- | --- | --- |\n| 5.1 | 3.5 | 1.4 | 0.2 | 0.0 | 0.0 |\n\nYay! Our prediction is matching the target label!\n\nLet's see if we can find some predictions not matching the target label.\nTo perform lots of predictions, we will use `sqml_predict_batch` which is more\nefficient than `sqml_predict`:\n\n```sql\nSELECT\n  dataset_iris.*,\n  batch.value AS prediction,\n  dataset_iris.target = batch.value AS match\nFROM\n  dataset_iris\n  JOIN json_each (\n    (\n      SELECT\n        sqml_predict_batch(\n          'Iris prediction',\n          json_group_array(\n            json_object(\n              'sepal length (cm)', [sepal length (cm)],\n              'sepal width (cm)', [sepal width (cm)],\n              'petal length (cm)', [petal length (cm)],\n              'petal width (cm)', [petal width (cm)]\n            )\n          )\n        )\n      FROM\n        dataset_iris\n    )\n  ) batch ON (batch.rowid + 1) = dataset_iris.rowid\nWHERE match = FALSE;\n```\n\nThis will yield the following output data:\n\n| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | prediction | match |\n| --- | --- | --- | --- | --- | --- | --- |\n| 5.9 | 3.2 | 4.8 | 1.8 | 1.0 | 2.0 | 0 |\n| 6.7 | 3.0 | 5.0 | 1.7 | 1.0 | 2.0 | 0 |\n| 6.0 | 2.7 | 5.1 | 1.6 | 1.0 | 2.0 | 0 |\n| 4.9 | 2.5 | 4.5 | 1.7 | 2.0 | 1.0 | 0 |\n\nOh no! 4 predictions have not predicted the correct target label!\n\nLet's see if we can train a better algorithm to enhance the prediction quality.\n\n### Training a new model\n\nLet's use a Support Vector Machine algorithm, usually yielding better results\ncompared to the more simplistic Logistic Regression:\n\n```sql\nSELECT sqml_train(\n  'Iris prediction',\n  'classification',\n  'svc',\n  'dataset_iris',\n  'target'\n) AS training;\n```\n\nThis will yield the following data:\n\n| training |\n| --- |\n| {\"experiment_name\": \"Iris prediction\", \"prediction_type\": \"classification\", \"algorithm\": \"svc\", \"deployed\": true, \"score\": 0.9736842105263158} |\n\nWe can already see that the score of this new model is higher than the previous one and it has been deployed.\n\nLet's try our new classifier on the same dataset:\n\n```sql\nSELECT\n  dataset_iris.*,\n  batch.value AS prediction,\n  dataset_iris.target = batch.value AS match\nFROM\n  dataset_iris\n  JOIN json_each (\n    (\n      SELECT\n        sqml_predict_batch(\n          'Iris prediction',\n          json_group_array(\n            json_object(\n              'sepal length (cm)', [sepal length (cm)],\n              'sepal width (cm)', [sepal width (cm)],\n              'petal length (cm)', [petal length (cm)],\n              'petal width (cm)', [petal width (cm)]\n            )\n          )\n        )\n      FROM\n        dataset_iris\n    )\n  ) batch ON (batch.rowid + 1) = dataset_iris.rowid\nWHERE match = FALSE;\n```\n\nThis will lead the following results:\n\n| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | prediction | match |\n| --- | --- | --- | --- | --- | --- | --- |\n| 5.9 | 3.2 | 4.8 | 1.8 | 1.0 | 2.0 | 0 |\n| 6.7 | 3.0 | 5.0 | 1.7 | 1.0 | 2.0 | 0 |\n| 6.0 | 2.7 | 5.1 | 1.6 | 1.0 | 2.0 | 0 |\n\nYay! We manage to predict one more target label with this new model!\n\nAlso note that we did not have to do anything to switch to the better model:\nexactly the same query is used to perform the prediction without having to\nspecify anything about the new model! This is because new models are deployed\nautomatically for the current experiment only if their score outperforms the\nscore of the previously deployed model.\n\n### SQL functions\n\nThis plugin registers a few SQL functions to perform machine learning model training and predictions:\n\n`sqml_load_dataset(name, table)`\n- `name: str`: name of the dataset to load\n- `table: str`: (optional) custom table name destination for the dataset\n\n`sqml_train(experiment_name, prediction_type, algorithm, dataset, target, test_size, split_strategy)`:\n- `experiment_name: str`: name of the experiment to train the model within\n- `prediction_type: str`: prediction task type to be performed for this experiment (`regression`, `classification`)\n- `algorithm: str`: algorithm type to be trained\n- `dataset: str`: name of the table or view containing the dataset\n- `target: str`: name of the column to be treated as target label\n- `test_size: float`: (optional) dataset test size ratio (default is `0.25`)\n- `split_strategy: str`: (optional) dataset train/test split strategy (default is `shuffle`)\n\n`sqml_predict(experiment_name, features)`\n- `experiment_name: str`: name of the experiment to train the model within\n- `features: json object`: JSON object containing the features\n\n`sqml_predict_batch(experiment_name, features)`\n- `experiment_name: str`: name of the experiment to train the model within\n- `features: json list`: JSON list containing all feature objects\n\n## Development\n\nTo set up this plugin locally, first checkout the code.\nThen create a new virtual environment and the required dependencies:\n\n```bash\npoetry shell\npoetry install\n```\n\nTo run the QA suite:\n\n```bash\nblack --check datasette_ml tests\nflake8 datasette_ml tests\nmypy datasette_ml tests\npytest -v --cov=datasette_ml --cov=tests --cov-branch --cov-report=term-missing tests\n```\n\n## Demo\n\nWith the developmnent environment setup, you can run the demo locally:\n\n```bash\npython demo/generate.py\ndatasette --metadata demo/metadata.yml demo/sqml.db\n```\n\n## Inspiration\n\nAll the things on the internet that have been inspiring this project:\n\n- [PostgresML](https://postgresml.org)\n- [MLFlow](https://mlflow.org)\n- [SQLite  Run-Time Loadable Extensions](https://www.sqlite.org/loadext.html)\n- [Alex Garcia's `sqlite-loadable-rs`](https://github.com/asg017/sqlite-loadable-rs)\n- [Alex Garcia's SQLite extensions](https://github.com/asg017)\n- [Alex Garcia, \"Making SQLite extensions pip install-able\"](https://observablehq.com/@asg017/making-sqlite-extensions-pip-install-able)\n- [Max Halford, \"Online gradient descent written in SQL\"](https://maxhalford.github.io/blog/ogd-in-sql/)\n- [Ricardo Anderegg, \"Extending SQLite with Rust\"](https://ricardoanderegg.com/posts/extending-sqlite-with-rust/)\n\n## License\n\nLicensed under Apache License, Version 2.0\n\nCopyright (c) 2023 - present Romain Clement\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frclement%2Fdatasette-ml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frclement%2Fdatasette-ml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frclement%2Fdatasette-ml/lists"}