{"id":15425129,"url":"https://github.com/eisber/sarplus","last_synced_at":"2025-08-08T13:12:10.738Z","repository":{"id":95102460,"uuid":"154530010","full_name":"eisber/sarplus","owner":"eisber","description":"pronounced sUrplus as it's simply better if not best!","archived":false,"fork":false,"pushed_at":"2021-04-20T20:25:54.000Z","size":62,"stargazers_count":12,"open_issues_count":1,"forks_count":3,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-03-12T20:12:17.293Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eisber.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-24T16:03:40.000Z","updated_at":"2022-02-13T12:20:37.000Z","dependencies_parsed_at":"2023-03-08T22:00:53.324Z","dependency_job_id":null,"html_url":"https://github.com/eisber/sarplus","commit_stats":{"total_commits":56,"total_committers":2,"mean_commits":28.0,"dds":0.1964285714285714,"last_synced_commit":"ebc19a0a2297565c41e24413a0d33fbfab93aef3"},"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eisber%2Fsarplus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eisber%2Fsarplus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eisber%2Fsarplus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eisber%2Fsarplus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eisber","download_url":"https://codeload.github.com/eisber/sarplus/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249735225,"owners_count":21318006,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T17:49:58.220Z","updated_at":"2025-04-19T16:14:52.759Z","avatar_url":"https://github.com/eisber.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# sarplus (preview)\r\npronounced sUrplus as it's simply better if not best!\r\n\r\n[![Build Status](https://dev.azure.com/marcozo-sarplus/sarplus/_apis/build/status/eisber.sarplus)](https://dev.azure.com/marcozo-sarplus/sarplus/_build/latest?definitionId=1)\r\n[![PyPI version](https://badge.fury.io/py/pysarplus.svg)](https://badge.fury.io/py/pysarplus)\r\n\r\nFeatures\r\n* Scalable PySpark based [implementation](python/pysarplus/SARPlus.py)\r\n* Fast C++ based [predictions](python/src/pysarplus.cpp)\r\n* Reduced memory consumption: similarity matrix cached in-memory once per worker, shared accross python executors \r\n* Easy setup using [Spark Packages](https://spark-packages.org/package/eisber/sarplus)\r\n\r\n# Benchmarks\r\n\r\n| # Users | # Items | # Ratings | Runtime | Environment | Dataset | \r\n|---------|---------|-----------|---------|-------------|---------|\r\n| 2.5mio  | 35k     | 100mio    | 1.3h    | Databricks, 8 workers, [Azure Standard DS3 v2](https://azure.microsoft.com/en-us/pricing/details/virtual-machines/linux/) (4 core machines) | |\r\n\r\n# Top-K Recommendation Optimization\r\n\r\nThere are a couple of key optimizations:\r\n\r\n* map item ids (e.g. strings) to a continuous set of indexes to optmize storage and simplify access\r\n* convert similarity matrix to exactly the representation the C++ component needs, thus enabling simple shared, memory mapping of the cache file and avoid parsing. This requires a customer formatter, written in Scala\r\n* shared read-only memory mapping allows us to re-use the same memory from multiple python executors on the same worker node\r\n* partition the input test users and past seen items by users, allowing for scale out\r\n* perform as much of the work as possible in PySpark (way simpler)\r\n* top-k computation\r\n** reverse the join by summing reverse joining the users past seen items with any related items\r\n** make sure to always just keep top-k items in-memory\r\n** use standard join using binary search between users past seen items and the related items\r\n\r\n![Image of sarplus top-k recommendation optimization](images/sarplus_udf.svg) \r\n\r\n# Usage\r\n\r\n```python\r\nimport pandas as pd\r\nfrom pysarplus import SARPlus\r\n\r\n# spark dataframe with user/item/rating/optional timestamp tuples\r\ntrain_df = spark.createDataFrame(\r\n      pd.DataFrame({\r\n        'user_id': [1, 1, 2, 3, 3],\r\n        'item_id': [1, 2, 1, 1, 3],\r\n        'rating':  [1, 1, 1, 1, 1],\r\n    }))\r\n   \r\n# spark dataframe with user/item tuples\r\ntest_df = spark.createDataFrame(\r\n      pd.DataFrame({\r\n        'user_id': [1, 3],\r\n        'item_id': [1, 3],\r\n        'rating':  [1, 1],\r\n    }))\r\n    \r\nmodel = SARPlus(spark, col_user='user_id', col_item='item_id', col_rating='rating', col_timestamp='timestamp')\r\nmodel.fit(train_df, similarity_type='jaccard')\r\n\r\n\r\nmodel.recommend_k_items(test_df, 'sarplus_cache', top_k=3).show()\r\n\r\n# For databricks\r\n# model.recommend_k_items(test_df, 'dbfs:/mnt/sarpluscache', top_k=3).show()\r\n```\r\n\r\n## Jupyter Notebook\r\n\r\nInsert this cell prior to the code above.\r\n\r\n```python\r\nimport os\r\n\r\nSUBMIT_ARGS = \"--packages eisber:sarplus:0.2.5 pyspark-shell\"\r\nos.environ[\"PYSPARK_SUBMIT_ARGS\"] = SUBMIT_ARGS\r\n\r\nfrom pyspark.sql import SparkSession\r\n\r\nspark = (\r\n    SparkSession.builder.appName(\"sample\")\r\n    .master(\"local[*]\")\r\n    .config(\"memory\", \"1G\")\r\n    .config(\"spark.sql.shuffle.partitions\", \"1\")\r\n    .config(\"spark.sql.crossJoin.enabled\", True)\r\n    .config(\"spark.ui.enabled\", False)\r\n    .getOrCreate()\r\n)\r\n```\r\n\r\n## PySpark Shell\r\n\r\n```bash\r\npip install pysarplus\r\npyspark --packages eisber:sarplus:0.2.5 --conf spark.sql.crossJoin.enabled=true\r\n```\r\n\r\n## Databricks\r\n\r\nOne must set the crossJoin property to enable calculation of the similarity matrix (Clusters / \u0026lt; Cluster \u0026gt; / Configuration / Spark Config)\r\n\r\n```\r\nspark.sql.crossJoin.enabled true\r\n```\r\n\r\n1. Navigate to your workspace \r\n2. Create library\r\n3. Under 'Source' select 'Maven Coordinate'\r\n4. Enter 'eisber:sarplus:0.2.5' or 'eisber:sarplus:0.2.6' if you're on Spark 2.4.1\r\n5. Hit 'Create Library'\r\n6. Attach to your cluster\r\n7. Create 2nd library\r\n8. Under 'Source' select 'Upload Python Egg or PyPI'\r\n9. Enter 'pysarplus'\r\n10. Hit 'Create Library'\r\n\r\nThis will install C++, Python and Scala code on your cluster.\r\n\r\nYou'll also have to mount shared storage\r\n\r\n1. Create [Azure Storage Blob](https://ms.portal.azure.com/#create/Microsoft.StorageAccount-ARM)\r\n2. Create storage account (e.g. \u003cyourcontainer\u003e)\r\n3. Create container (e.g. sarpluscache)\r\n\r\n1. Navigate to User / User Settings\r\n2. Generate new token: enter 'sarplus'\r\n3. Use databricks shell (installation here)\r\n4. databricks configure --token\r\n4.1. Host: e.g. https://westus.azuredatabricks.net\r\n5. databricks secrets create-scope --scope all --initial-manage-principal users\r\n6. databricks secrets put --scope all --key sarpluscache\r\n6.1. enter Azure Storage Blob key of Azure Storage created before\r\n7. Run mount code\r\n\r\n\r\n```pyspark\r\ndbutils.fs.mount(\r\n  source = \"wasbs://sarpluscache@\u003caccountname\u003e.blob.core.windows.net\",\r\n  mount_point = \"/mnt/sarpluscache\",\r\n  extra_configs = {\"fs.azure.account.key.\u003caccountname\u003e.blob.core.windows.net\":dbutils.secrets.get(scope = \"all\", key = \"sarpluscache\")})\r\n```\r\n\r\nDisable annoying logging\r\n\r\n```pyspark\r\nimport logging\r\nlogging.getLogger(\"py4j\").setLevel(logging.ERROR)\r\n```\r\n\r\n\r\n# Packaging\r\n\r\nFor [databricks](https://databricks.com/) to properly install a [C++ extension](https://docs.python.org/3/extending/building.html), one must take a detour through [pypi](https://pypi.org/).\r\nUse [twine](https://github.com/pypa/twine) to upload the package to [pypi](https://pypi.org/).\r\n\r\n```bash\r\ncd python\r\n\r\npython setup.py sdist\r\n\r\ntwine upload dist/pysarplus-*.tar.gz\r\n```\r\n\r\nOn [Spark](https://spark.apache.org/) one can install all 3 components (C++, Python, Scala) in one pass by creating a [Spark Package](https://spark-packages.org/). Documentation is rather sparse. Steps to install\r\n\r\n1. Package and publish the [pip package](python/setup.py) (see above)\r\n2. Package the [Spark package](scala/build.sbt), which includes the [Scala formatter](scala/src/main/scala/eisber/sarplus) and references the [pip package](scala/python/requirements.txt) (see below)\r\n3. Upload the zipped Scala package to [Spark Package](https://spark-packages.org/) through a browser. [sbt spPublish](https://github.com/databricks/sbt-spark-package) has a few [issues](https://github.com/databricks/sbt-spark-package/issues/31) so it always fails for me. Don't use spPublishLocal as the packages are not created properly (names don't match up, [issue](https://github.com/databricks/sbt-spark-package/issues/17)) and furthermore fail to install if published to [Spark-Packages.org](https://spark-packages.org/).  \r\n\r\n```bash\r\ncd scala\r\nsbt spPublish\r\n```\r\n\r\n# Testing\r\n\r\nTo test the python UDF + C++ backend\r\n\r\n```bash\r\ncd python \r\npython setup.py install \u0026\u0026 pytest -s tests/\r\n```\r\n\r\nTo test the Scala formatter\r\n\r\n```bash\r\ncd scala\r\nsbt test\r\n```\r\n\r\n(use ~test and it will automatically check for changes in source files, but not build.sbt)\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feisber%2Fsarplus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feisber%2Fsarplus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feisber%2Fsarplus/lists"}