{"id":13574379,"url":"https://github.com/oneapi-src/intelligent-indexing","last_synced_at":"2025-04-04T15:30:45.787Z","repository":{"id":66145925,"uuid":"506429927","full_name":"oneapi-src/intelligent-indexing","owner":"oneapi-src","description":"AI Starter Kit for Intelligent Indexing of Incoming Correspondence using Intel® Extension for Scikit-learn*","archived":true,"fork":false,"pushed_at":"2024-02-01T23:51:25.000Z","size":267,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-05T09:44:32.595Z","etag":null,"topics":["machine-learning","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oneapi-src.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-06-22T23:00:25.000Z","updated_at":"2024-04-22T14:15:07.000Z","dependencies_parsed_at":"2024-02-13T00:49:35.001Z","dependency_job_id":null,"html_url":"https://github.com/oneapi-src/intelligent-indexing","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fintelligent-indexing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fintelligent-indexing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fintelligent-indexing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oneapi-src%2Fintelligent-indexing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oneapi-src","download_url":"https://codeload.github.com/oneapi-src/intelligent-indexing/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247202538,"owners_count":20900797,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","scikit-learn"],"created_at":"2024-08-01T15:00:51.074Z","updated_at":"2025-04-04T15:30:40.776Z","avatar_url":"https://github.com/oneapi-src.png","language":"Python","readme":"PROJECT NOT UNDER ACTIVE MANAGEMENT\n\nThis project will no longer be maintained by Intel.\n\nIntel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.  \n\nIntel no longer accepts patches to this project.\n\nIf you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.  \n\nContact: webadmin@linux.intel.com\n# **Intelligent Indexing**\n\n## Introduction\n\nMany industries ingest massive volumes of complex documents and must utilize manual processes to both understand the contents of and route them to the relevant parties. AI-based Natural Language Processing (NLP) solutions for classifying documents can be one solution to automate this process, saving massive amounts of workforce, time, and cost while still maintaining human-level performance.\n\nThis example demonstrates one way of building an NLP pipeline for classifying documents to their respective topics and describe how we can leverage the [Intel® oneAPI AI Analytics Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html) (oneAPI) to accelerate the pipeline.\n\n## Solution Technical Overview\n\nMethodology wise, the use case will train a Support Vector Classifiers (SVC) for multiclass classification which ingests a body of text and outputs the predicted topic of the document.  At deployment, natural text is first mapped into Term Frequency-Inverse Document Frequency (TFIDF) vectors, which is then fed into our trained SVC to obtain predictions about the potential topic of the original text. SVC is a commonly and historically used algorithm for building powerful NLP classifiers using Machine Learning (ML) due to its ability to tackle the highly non-linear and complex relationships often found in text documents [[1]](#joachims_1998)[[2]](#manning_2010). With recent advancements in NLP based solutions, it can be seen as a starting point before considering more advanced Deep Learning (DL) based NLP algorithms.\n\n![workflow](assets/workflow.png)\n\nThe savings gained from using [Intel® oneAPI Data Analytics Library](https://www.intel.com/content/www/us/en/developer/tools/oneapi/onedal.html) (oneDAL) can result in more efficient model training and inference, leading to more robust Artificial Intelligence (AI) powered systems.\n\noneDAL is used to achieve quick results even when the data for a model are huge. It provides the capability to reuse the code present in different languages so that the hardware utilization is optimized to provide these results.\n\nThe solution contained in this repo uses the following Intel® packages:\n\n* ***Intel® Distribution for Python****\n\n    The [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html#gs.52te4z) provides:\n\n  * Scalable performance using all available CPU cores on laptops, desktops, and powerful servers\n  * Support for the latest CPU instructions\n  * Near-native performance through acceleration of core numerical and machine learning packages with libraries like the Intel® oneAPI Math Kernel Library (oneMKL) and Intel® oneAPI Data Analytics Library\n  * Productivity tools for compiling Python code into optimized instructions\n  * Essential Python bindings for easing integration of Intel® native tools with your Python* project\n\n* ***Intel® Distribution of Modin\\****\n\n    The [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html) is a performant, parallel, and distributed dataframe system that is designed around enabling data scientists to be more productive with the tools that they love. This library is fully compatible with the pandas API. It is powered by OmniSci* in the back end and provides accelerated analytics on Intel® platforms.\n\n    Top Benefits:\n    1. Drop-in acceleration to your existing pandas workflows.\n    2. No upfront cost to learning a new API.\n    3. Integrates with the Python* ecosystem.\n    4. Seamlessly scales across multicores with Ray* and Dask* clusters (run on and with what you have).\n\n* ***Intel® Extension for Scikit-learn\\****\n\n    Designed for data scientists, [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html) is a seamless way to speed up your Scikit-learn applications for machine learning to solve real-world problems. This extension package dynamically patches scikit-learn estimators to use Intel® oneAPI Data Analytics Library (oneDAL) as the underlying solver, while achieving the speed up for your machine learning algorithms out-of-box.\n\nFor more details, visit [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html), [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html) and [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html).\n\n## Solution Technical Details\n\nIn this section, we describe the data and how to replicate the results. The included code demonstrates a complete framework for:\n\n1. Setting up a virtual environment for Intel®-accelerated ML.\n2. Preprocessing data using Intel® Distribution of Modin* and NLTK*.\n3. Training an NLP model for text classification using Intel® Extension for Scikit-learn*.\n4. Predicting from the trained model on new data using Intel® Extension for Scikit-learn*.\n\n### Dataset\n\nThe dataset used for this demo is a set of ~200k news article with their respective topics obtained by mining the Huffington Post website originally obtained from https://www.kaggle.com/datasets/rmisra/news-category-dataset.\n\n\u003e *Please see this data set's applicable license for terms and conditions. Intel does not own the rights to this data set and does not confer any rights to it.*\n\nThe included dataset is lightly preprocessed from above to split into train/test according to an 85:15 train test split. To download and setup this dataset for benchmarking, follow the instructions listed [here](#download-the-dataset).\n\n## Validated Hardware Details\n\nThere are workflow-specific hardware and software setup requirements to run this use case.\n\n| Recommended Hardware\n| ----------------------------\n| CPU: Intel® 2nd Gen Xeon® Platinum 8280 CPU @ 2.70GHz or higher\n| RAM: 187 GB\n| Recommended Free Disk Space: 20 GB or more\n\n### Minimal Requirements\n\n* RAM: 32 GB total memory\n* CPUs: 8\n* Storage: 20GB\n* Operating system: Ubuntu\\* 22.04 LTS\n\n## How it Works\n\nTo demonstrate the application of multi class document classification, using the `News Category Dataset`, we will build a model to predict the `category` of each news article based entirely on the `headline`, `short_description`, and `URL` of the given news article. In total, there are 42 unique categories which are described [here](https://www.kaggle.com/datasets/rmisra/news-category-dataset).\n\n## Get Started\n\nStart by **defining an environment variable** that will store the workspace path, this can be an existing directory or one to be created in further steps. This ENVVAR will be used for all the commands executed using absolute paths.\n\n[//]: # (capture: baremetal)\n\n```bash\nexport WORKSPACE=$PWD/intelligent-indexing\n```\n\nDefine `DATA_DIR` and `OUTPUT_DIR` as follows:\n\n[//]: # (capture: baremetal)\n\n```bash\nexport DATA_DIR=$WORKSPACE/data\nexport OUTPUT_DIR=$WORKSPACE/output\n```\n\n### Download the Workflow Repository\n\nCreate a working directory for the workflow and clone the [Intelligent Indexing](https://github.com/oneapi-src/intelligent-indexing) repository into your working directory.\n\n[//]: # (capture: baremetal)\n\n```bash\nmkdir -p $WORKSPACE \u0026\u0026 cd $WORKSPACE\n```\n\n```bash\ngit clone https://github.com/oneapi-src/intelligent-indexing.git $WORKSPACE\n```\n\n### Set up conda\n\n1. Download the appropriate Miniconda Installer for linux.\n\n    ```bash\n    wget -q https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\n    ```\n\n2. In your terminal window, run.\n\n    ```bash\n    bash Miniconda3-latest-Linux-x86_64.sh\n    ```\n\n3. Delete downloaded file.\n\n    ```bash\n    rm Miniconda3-latest-Linux-x86_64.sh\n    ```\n\nTo learn more about conda installation, see the [Conda Linux installation instructions](https://docs.conda.io/projects/conda/en/stable/user-guide/install/linux.html).\n\n### Set Up Environment\n\nInstall and set the libmamba solver as default solver. Run the following commands:\n\n```bash\nconda install -n base conda-libmamba-solver -y\nconda config --set solver libmamba\n```\n\nThe `$WORKSPACE/env/intel_env.yml` file contains all dependencies to create the intel environment necessary for running the workflow.\n\n| **Packages required in YAML file**| **Version**\n| :---                              | :--\n| python                            | 3.10\n| intelpython3_core                 | 2024.0.0\n| scikit-learn-intelex              | 2024.0.1\n| modin-all                         | 0.24.1\n| nltk                              | 3.8.1\n| kaggle                            | 3.8.1\n\nExecute next command to create the conda environment.\n\n```bash\nconda env create -f $WORKSPACE/env/intel_env.yml\n```\n\nEnvironment setup is required only once. This step does not cleanup the existing environment with the same name hence we need to make sure there is no conda environment with the same name. During this setup, `intelligent_indexing_intel` conda environment will be created with the dependencies listed in the YAML configuration.\n\nFinally, activate `intelligent_indexing_intel`  environment using the following command:\n\n```bash\nconda activate intelligent_indexing_intel\n```\n\n### Download the Dataset\n\nTo setup the data for benchmarking, do the following:\n\n1. Configure your [credentials](https://github.com/Kaggle/kaggle-api#api-credentials) and [proxies](https://github.com/Kaggle/kaggle-api#set-a-configuration-value).\n\n2. Download the data from https://www.kaggle.com/datasets/rmisra/news-category-dataset, save it to data directory and unzip it. This should produce a file called `News_Category_Dataset_v3.json` which we will need to split and save into the required files.\n\n    ```bash\n    cd $DATA_DIR\n    kaggle datasets download -d rmisra/news-category-dataset\n    unzip news-category-dataset.zip \u0026\u0026 rm news-category-dataset.zip\n    ```\n\n\u003e *Please see this data set's applicable license for terms and conditions. Intel does not own the rights to this data set and does not confer any rights to it.*\n\n## Supported Runtime Environment\n\nYou can execute the references pipelines using the following environments:\n\n* Bare Metal\n* Jupyter Notebook\n\n### Run Using Bare Metal\n\nFollow these instructions to set up and run this workflow on your own development system.\n\n#### Set Up System Software\n\nOur examples use the ``conda`` package and environment on your local computer. If you don't already have ``conda`` installed, go to [Set up conda](#set-up-conda) or see the [Conda Linux installation instructions](https://docs.conda.io/projects/conda/en/stable/user-guide/install/linux.html).\n\n#### Run Workflow\n\nTo run the benchmarks with ***Intel® oneAPI technologies***, the environment `intelligent_indexing_intel` should be activated using:\n\n```bash\nconda activate intelligent_indexing_intel\n```\n\n##### Setting up the data\n\nThe benchmarking scripts expects 2 files to be present in `data/huffpost`.\n\n* `data/huffpost/train_all.csv`: training data\n* `data/huffpost/test.csv`: testing data\n\nAfter downloading the data for benchmarking under these requirements, do the following:\n\n* Use the `process_data.py` script to generate the `huffpost/train_all.csv` and `huffpost/test.csv` files for benchmarking. This script expects `News_Category_Dataset_v3.json` to be present in the same directory.\n\n    [//]: # (capture: baremetal)\n\n    ```bash\n    cd $DATA_DIR\n    python process_data.py\n    ```\n\nAll of the benchmarking can be run using the python script `src/run_benchmarks.py`.\n\nThe script **reads and preprocesses the data**, **trains an SVC model**, and **predicts on unseen test data** using the trained model, while also reporting on the execution time for these 3 steps.\n\n\u003e Before running the script, we need to ensure that the appropriate conda environment is activated.\n\nThe run benchmark script takes the following arguments:\n\n```bash\nusage: run_benchmarks.py [-h] [-l LOGFILE] [-p] [-s SAVE_MODEL_DIR]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -l LOGFILE, --logfile LOGFILE\n                        log file to output benchmarking results to\n  -p, --preprocessing_only\n                        only perform preprocessing step\n  -s SAVE_MODEL_DIR, --save_model_dir SAVE_MODEL_DIR\n                        directory to save model to\n```\n\nTo run with Intel® technologies, and log the performance to `$OUTPUT_DIR/logs/intel.log`, we would run (after creating the appropriate environment as above) from `src` directory:\n\n[//]: # (capture: baremetal)\n\n```shell\ncd $WORKSPACE/src\nmkdir -p $OUTPUT_DIR/logs  # create logs dir in the OUTPUT_DIR dir if not present\npython run_benchmarks.py -l $OUTPUT_DIR/logs/intel.log\n```\n\nInspect the generated log to check `Test Accuracy`, `Training Time`, `Inference Time` and `Total time` data:\n\n[//]: # (capture: baremetal)\n\n```bash\ntail $OUTPUT_DIR/logs/intel.log\n```\n\n#### Clean Up Bare Metal\n\nFollow these steps to restore your ``$WORKSPACE`` directory to an initial step. Please note that all downloaded dataset files, conda environment, and logs created by workflow will be deleted. Before executing next steps back up your important files.\n\n```bash\nconda deactivate\nconda remove --name intelligent_indexing_intel --all -y\ncd $DATA_DIR\nrm -r huffpost News_Category_Dataset_v3.json\nrm -r $OUTPUT_DIR/logs\n```\n\n### Run Using Jupyter Notebook\n\nYou can directly access the Jupyter Notebook shared in this repo [here](./IntelligentIndexing.ipynb).\n\n1. Follow the instructions described on [Get Started](#get-started) to set required environment variables.\n\nTo launch Jupyter Notebook, execute the next commands:\n\n1. Execute [Set Up Conda](#set-up-conda) and [Set Up environment](#set-up-environment) steps.\n\n2. Activate Intel environment.\n\n    ```bash\n    conda activate intelligent_indexing_intel\n    ```\n\n3. Install the IPython Kernel Package.\n\n    ```bash\n    conda install -c intel ipykernel -y\n    ```\n\n4. Create a virtual environment and Install Jupyter Notebook.\n\n    ```shell\n    conda create -n jupyter_server -c intel nb_conda_kernels notebook -y\n    ```\n\n5. Activate Jupyter Server environment.\n\n    ```shell\n    conda activate jupyter_server\n    ```\n\n6. Change to working directory.\n\n    ```shell\n    cd $WORKSPACE\n    ```\n\n7. Execute Jupyter command.\n\n    ```shell\n    jupyter notebook\n    \n    ```\n\n#### Connect to Jupyter Notebook Server\n\nAbove command prints some information about the notebook server in your terminal, including the URL of the web application (by default, http://localhost:8888), for example:\n\n```shell\nTo access the notebook, open this file in a browser: \nfile:///path/to/jupyter/notebook/server/open.html\nOr copy and paste one of these URLs: \nhttp://*********:8888/?token=***************************************** or \nhttp://127.0.0.1:8888/?token=*****************************************\n```\n\nCopy and paste one of the URLs into a web browser to open the Jupyter Notebook Dashboard.\n\nOnce in Jupyter, click on **IntelligentIndexing.ipynb** to get an interactive demo of the workflow.\n\n#### Clean Up Jupyter Notebook\n\nClean Bare Metal Environment executing next commands:\n\n```bash\nconda activate base\nconda remove --name intelligent_indexing_intel --all -y\nconda remove --name jupyter_server --all -y\ncd $DATA_DIR\nrm -r huffpost News_Category_Dataset_v3.json\nrm -r $OUTPUT_DIR/logs\n```\n\n## Expected Output\n\nBenchmark results are stored in the `$OUTPUT_DIR/logs/intel.log` file.\n\nCheck out the `Test Accuracy`, `Training Time`, `Inference Time` and `Total time` of the workflow. For example:\n\n```bash\nINFO:root:=======\u003e Test Accuracy : 0.63\nINFO:root:=======\u003e Training Time : 229.324 secs\nINFO:root:=======\u003e Inference Time : 128.583 secs\nINFO:root:=======\u003e Total time : 374.083 secs\n```\n\n## Summary and Next Steps\n\nWe break it down into the 3 primary tasks of this ML pipeline:\n\n1. Preprocessing data using Intel® Distribution of Modin* with the Ray* Backend.\n2. Training an NLP model for text classification using Intel® Extension for Scikit-learn*.\n3. Predicting from the trained model on new data using Intel® Extension for Scikit-learn*.\n\nThis exercise to categorize text data, can be used as a reference implementation across similar use cases with Intel AI optimizations enabled to accelerate the End-to-End (E2E) process.\n\n## Learn More\n\nFor more information about or to read about other relevant workflow examples, see these guides and software resources:\n\n* [Intel® AI Analytics Toolkit (AI Kit)](https://www.intel.com/content/www/us/en/developer/tools/oneapi/ai-analytics-toolkit.html)\n* [Intel® Distribution for Python*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-for-python.html)\n* [Intel® Distribution of Modin*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/distribution-of-modin.html)\n* [Intel® Extension for Scikit-Learn*](https://www.intel.com/content/www/us/en/developer/tools/oneapi/scikit-learn.html)\n\n## Support\n\nIf you have questions or issues about this use case, want help with troubleshooting, want to report a bug or submit enhancement requests, please submit a GitHub issue.\n\n## Appendix\n\n### References\n\n\u003ca id=\"joachims_1998\"\u003e[1]\u003c/a\u003e Joachims, Thorsten. \"Text categorization with support vector machines: Learning with many relevant features.\" European conference on machine learning. Springer, Berlin, Heidelberg, 1998.\n\n\u003ca id=\"manning_2010\"\u003e[2]\u003c/a\u003e Manning, Christopher, Prabhakar Raghavan, and Hinrich Schütze. \"Introduction to information retrieval. Chapter 15.\" Natural Language Engineering 16.1 (2010): 100-103.\n\n\\*Other names and brands that may be claimed as the property of others. [Trademarks](https://www.intel.com/content/www/us/en/legal/trademarks.html).\n\n### Disclaimers\n\nTo the extent that any public or non-Intel datasets or models are referenced by or accessed using tools or code on this site those datasets or models are provided by the third party indicated as the content source. Intel does not create the content and does not warrant its accuracy or quality. By accessing the public content, or using materials trained on or with such content, you agree to the terms associated with that content and that your use complies with the applicable license.\n\nIntel expressly disclaims the accuracy, adequacy, or completeness of any such public content, and is not liable for any errors, omissions, or defects in the content, or for any reliance on the content. Intel is not liable for any liability or damages relating to your use of public content.\n","funding_links":[],"categories":["Table of Contents"],"sub_categories":["AI - Frameworks and Toolkits"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foneapi-src%2Fintelligent-indexing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foneapi-src%2Fintelligent-indexing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foneapi-src%2Fintelligent-indexing/lists"}