{"id":18743077,"url":"https://github.com/kxsystems/ml","last_synced_at":"2025-08-22T06:43:30.257Z","repository":{"id":46087564,"uuid":"151331160","full_name":"KxSystems/ml","owner":"KxSystems","description":"Machine-learning toolkit","archived":false,"fork":false,"pushed_at":"2024-12-24T00:57:09.000Z","size":61494,"stargazers_count":65,"open_issues_count":6,"forks_count":30,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-08-18T13:46:00.424Z","etag":null,"topics":["fresh","kdb","machine-learning","python","q"],"latest_commit_sha":null,"homepage":"https://code.kx.com/q/ml","language":"q","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KxSystems.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-02T22:21:54.000Z","updated_at":"2025-08-17T07:14:59.000Z","dependencies_parsed_at":"2024-06-26T16:07:38.078Z","dependency_job_id":"3ef450a1-8c27-4a88-a5a3-5390fd622df4","html_url":"https://github.com/KxSystems/ml","commit_stats":{"total_commits":319,"total_committers":17,"mean_commits":"18.764705882352942","dds":0.7178683385579938,"last_synced_commit":"5509fa6cfc454c68bf3441672fe1a26cb5a19088"},"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"purl":"pkg:github/KxSystems/ml","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KxSystems%2Fml","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KxSystems%2Fml/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KxSystems%2Fml/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KxSystems%2Fml/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KxSystems","download_url":"https://codeload.github.com/KxSystems/ml/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KxSystems%2Fml/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271599738,"owners_count":24787801,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fresh","kdb","machine-learning","python","q"],"created_at":"2024-11-07T16:09:56.690Z","updated_at":"2025-08-22T06:43:30.196Z","avatar_url":"https://github.com/KxSystems.png","language":"q","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Machine Learning Toolkit\n\nThe Machine Learning Toolkit is a comprehensive suite designed to empower kdb+/q users with advanced machine learning capabilities. It offers a robust and flexible framework for addressing a wide range of tasks, including time series analysis, natural language processing, and automated machine learning. By integrating seamlessly with kdb+/q, the toolkit facilitates efficient data handling and processing, leveraging both traditional machine learning techniques and modern NLP models.\n\nThe repository is structured as three modules: ml and nlp can each be used independently for their respective feature sets [as further described below](#components); automl builds upon ml and nlp to deliver automated machine learning capabilities.\n\n\u003c!-- ## Getting started\n\nTo get up and running quickly, start by pulling the Docker image, which comes pre-installed with all dependencies specified in requirements_pinned.txt. This allows you to dive straight into trying out our [examples](examples/) and exploring the toolkit's capabilities without the need for additional setup.\n\n```bash\ngit clone https://github.com/KxSystems/ml.git ml\ndocker pull \u003cimage\u003e\ndocker run -itv ./ml:/ml -e QLIC_K4=$(cat $QHOME/k4.lic | base64 -w0) --entrypoint /bin/bash \u003cimage\u003e\n\n# Now within the container, source the initial environment setup script\ncd /ml\nsource scripts/setup.sh\nsource scripts/pykx.sh # Switch from embedpy to pykx (optionally continue with embedpy)\nsource scripts/link.sh # Install the toolkit into your selected QHOME\n\n# Now simply start q Load and work with the desired components in q\nrlwrap q\nq)\\l nlp/nlp.q\nq).nlp.loadfile`:init.q\nLoading init.q\nLoading code/utils.q\nLoading code/regex.q\nLoading code/sent.q\nLoading code/parser.q\nLoading code/time.q\nLoading code/date.q\nLoading code/email.q\nLoading code/cluster.q\nLoading code/nlp_code.q\nq).nlp.findTimes\"I went to work at 9:00am and had a coffee at 10:20\"  # See examples/ for more advanced usage.\n09:00:00.000 \"9:00am\" 18 24\n10:20:00.000 \"10:20\"  45 50\nq)\n``` --\u003e\n\n### Requirements\n\n- kdb+ \u003e= 3.5 64-bit\n\nThe Python packages required to allow successful execution of all functions within the machine learning toolkit can be installed via:\n\npip:\n```bash\npip install -r requirements.txt\n```\n\nor via conda:\n```bash\nconda install --file requirements.txt\n```\n\nAlternatively, use `requirements_pinned.txt` for a fully resolved, pinned \u0026 known working set of dependencies or module specific requirements.txt (eg ml/requirements.txt) when only utilizing a subset of the toolkit.\n\nWhile the nlp framework may be used with other models, automl the nlp tests use en_core_web_sm. You can download this after installing the python requirements like so:\n```bash\npython -m spacy download en_core_web_sm\n```\n\n\u003c!-- //! optional reqs for automl --\u003e\n\n\n### Installation\n\nTo install, simply copy or link the desired components to your `$QHOME` directory, for example: `cp -r {ml,nlp,automl} $QHOME/`.\n\nTo load all functionality into the `.automl`, `.ml`, and `.nlp` namespaces, run the following from q:\n```q\n\\l automl/automl.q\n.automl.loadfile`:init.q\n```\n\n* To load only specific modules, replace automl with ml or nlp in the commands above.\n\nOnce installed, you can explore the toolkit's capabilities by trying out our [examples](examples/).\n\n\n\u003c!-- ### Examples   //! currently outdated\n\nExamples showing implementations of several components of this toolkit can be found [here](https://github.com/KxSystems/mlnotebooks/). These notebooks include examples of the following sections of the toolkit.\n\n*  Pre-processing functions\n*  Implementations of the FRESH algorithm\n*  Cross validation and grid search capabilities\n*  Results Scoring functionality\n*  Clustering methods applied to datasets\n*  Timeseries modeling examples --\u003e\n\n\n## Components\n### ml\nThis library contains functions that cover the following areas:\n- An implementation of the FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm for use in the extraction of features from time series data and the reduction in the number of features through statistical testing.\n- Cross-validation and grid-search functions allowing for testing of the stability of models to changes in the volume of data or the specific subsets of data used in training.\n- Clustering algorithms used to group data points and to identify patterns in their distributions. The algorithms make use of a k-dimensional tree to store points and scoring functions to analyze how well they performed.\n- Statistical timeseries models and feature-extraction techniques used for the application of machine learning to timeseries problems. These models allow for the forecasting of the future behavior of a system under various conditions.\n- Numerical techniques for calculating the optimal parameters for an objective function.\n- A graphing and pipeline library for the creation of modularized executable workflow based on a structure described by a mathematical directed graph.\n- Utility functions relating to areas including statistical analysis, data preprocessing and array manipulation.\n- A multi-processing framework to parallelize work across many cores or nodes.\n- Functions for seamless integration with PyKX or EmbedPy, which ensure seamless interoperability between Python and kdb+/q in either environment.\n- A location for the storage and versioning of ML models on-prem along with a common model retrieval API allowing models regardless of underlying requirements to be retrieved and used on kdb+ data. This allows for enhanced team collaboration opportunities and management oversight by centralising team work to a common storage location.\n\nThese sections are explained in greater depth within the [FRESH](ml/docs/fresh.md), [cross validation](ml/docs/xval.md), [clustering](ml/docs/clustering/algos.md), [timeseries](ml/docs/timeseries/README.md), [optimization](ml/docs/optimize.md), [graph/pipeline](ml/docs/graph/README.md), [utilities](ml/docs/utilities/metric.md) and [registry](ml/docs/registry/README.md) documentation.\n\n\n### nlp\n\nThe Natural language processing (NLP) module allows users to parse dataset using the spacy model from python in which it runs tokenisation, Sentence Detection, Part of speech tagging and Lemmatization. In addition to parsing, users can cluster text documents together using different clustering algorithms like MCL, K-means and radix. You can also run sentiment analysis which indicates whether a word has a positive or negative sentiment.\n\n\u003c!-- //! docs? old link is dead: Documentation is available on the [nlp](https://code.kx.com/v2/ml/nlp/) homepage.--\u003e\n\n\n### automl\n\nThe automated machine learning library described here is built on top of ml \u0026 nlp. The purpose of this framework is help you automate the process of applying machine learning techniques to real-world problems. In the absence of expert machine-learning engineers this handles the following processes within a traditional workflow.\n\n- Data preprocessing\n- Feature engineering and feature selection\n- Model selection\n- Hyperparameter Tuning\n- Report generation and model persistence\n\nEach of these steps is outlined in depth within the [documentation](automl/docs).\n\n\u003c!--\n## Building the docker images\n\n### preflight\nYou will need [Docker installed](https://www.docker.com/community-edition) on your workstation; make sure it is a recent version.\n\nCheck out a copy of the project with `git clone https://github.com/KxSystems/ml.git`.\n\n### building\n\nTo build the project locally:\n\n```bash //! improve\ndocker build -t registry.gitlab.com/kxdev/kxinsights/data-science/ml-tools/automl:embedpy-gcc-deb12 -f docker/Dockerfile .\ndocker build -t myimage:mytag -f docker/Dockerfile .\n``` --\u003e\n\n\u003c!-- **N.B.** if you wish to use an alternative source for [embedPy](https://github.com/KxSystems/embedPy) then you can append `--build-arg embedpy_img=embedpy` to your argument list. --\u003e\n\n\u003c!-- Other build arguments are supported and you should browse the `Dockerfile` to see what they are. --\u003e\n\n\u003c!-- Once built, you should have a local image which you can run with as shown in the \"Getting started\" section above. --\u003e\n\n\u003c!-- ### Deploy //! outdated\n\n[travisCI](https://travis-ci.org/) is configured to monitor when tags of the format `/^[0-9]+\\./` are added to the [GitHub hosted project](https://github.com/KxSystems/ml), a corresponding Docker image is generated and made available on [Docker Cloud](https://cloud.docker.com/)\n\nThis is all done server side as the resulting image is large.\n\nTo do a deploy, you simply tag and push your releases as usual:\n```bash\ngit push\ngit tag 0.7\ngit push --tag\n``` --\u003e\n\n\n## Status\n\nThe Machine Learning Toolkit is provided here under an Apache 2.0 license.\n\nIf you find issues with the interface or have feature requests, please [raise an issue](https://github.com/KxSystems/ml/issues).\n\nTo contribute to this project, please follow the [contributing guide](CONTRIBUTING.md).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkxsystems%2Fml","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkxsystems%2Fml","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkxsystems%2Fml/lists"}