{"id":19283025,"url":"https://github.com/david-cortes/hpfrec","last_synced_at":"2025-05-07T15:23:14.960Z","repository":{"id":57437594,"uuid":"135206877","full_name":"david-cortes/hpfrec","owner":"david-cortes","description":"Python implementation of 'Scalable Recommendation with Hierarchical Poisson Factorization'.","archived":false,"fork":false,"pushed_at":"2025-01-06T19:53:46.000Z","size":322,"stargazers_count":79,"open_issues_count":0,"forks_count":19,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-04-30T04:18:05.257Z","etag":null,"topics":["implicit-feedback","poisson-factorization"],"latest_commit_sha":null,"homepage":"http://hpfrec.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-2-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/david-cortes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-28T20:35:59.000Z","updated_at":"2025-01-06T19:53:50.000Z","dependencies_parsed_at":"2024-06-21T17:51:52.333Z","dependency_job_id":"096feb38-9afe-40c8-8a11-33ecd7433cf0","html_url":"https://github.com/david-cortes/hpfrec","commit_stats":{"total_commits":108,"total_committers":4,"mean_commits":27.0,"dds":"0.13888888888888884","last_synced_commit":"4d96bf3c5e2b7e71882c2564dc5679961d2652a3"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/david-cortes%2Fhpfrec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/david-cortes%2Fhpfrec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/david-cortes%2Fhpfrec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/david-cortes%2Fhpfrec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/david-cortes","download_url":"https://codeload.github.com/david-cortes/hpfrec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252903179,"owners_count":21822391,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["implicit-feedback","poisson-factorization"],"created_at":"2024-11-09T21:29:27.385Z","updated_at":"2025-05-07T15:23:14.939Z","avatar_url":"https://github.com/david-cortes.png","language":"Python","readme":"# Hierarchical Poisson Factorization\n\nThis is a Python package for hierarchical Poisson factorization, a form of probabilistic matrix factorization used for recommender systems with implicit count data, based on the paper _Scalable Recommendation with Hierarchical Poisson Factorization (P. Gopalan, 2015)_.\n\nAlthough the package was created with recommender systems in mind, it can also be used for other domains, e.g. as a faster alternative to LDA (Latent Ditichlet Allocation), where users become documents and items become words.\n\nSupports parallelization, full-batch variational inference, mini-batch stochastic variational inference (alternating between epochs sampling batches of users and epochs sampling batches of items), and different stopping criteria for the coordinate-ascent procedure. The main computations are written in fast Cython code.\n\nAs a point of reference, fitting the model through full-batch updates to the MillionSong TasteProfile dataset (48M records from 1M users on 380K items) took around 45 minutes on a server from Google Cloud with Skylake CPU when using 24 cores.\n\nFor a similar package using also item/user side information see [ctpfrec](https://github.com/david-cortes/ctpfrec).\n\nFor a non-Bayesian version which can produce sparse factors see [poismf](https://github.com/david-cortes/poismf).\n\n** *\n\n*Note: this package can also be used from within [LensKit](https://github.com/lenskit/lkpy), which adds functionalities such as cross-validation and calculation of recommendation quality metrics.*\n\n## Model description\n\nThe model consists in producing a non-negative low-rank matrix factorization of counts data (such as number of times each user played each song in some internet service) `Y ~= UV'`, produced by a generative model as follows:\n```\nksi_u ~ Gamma(a_prime, a_prime/b_prime)\nTheta_uk ~ Gamma(a, ksi_u)\n\neta_i ~ Gamma(c_prime, c_prime/d_prime)\nBeta_ik ~ Gamma(c, eta_i)\n\nY_ui ~ Poisson(Theta_u' Beta_i)\n```\nThe parameters are fit using mean-field approximation (a form of Bayesian variational inference) with coordinate ascent (updating each parameter separately until convergence).\n\n## Installation\n\n**Note:** requires a C compiler configured for Python. See [this guide](https://github.com/david-cortes/installing-optimized-libraries) for instructions.\n\nPackage is available on PyPI, can be installed with:\n\n```\npip install hpfrec\n```\n\nOr if that fails:\n```\npip install --no-use-pep517 hpfrec\n```\n\n** *\n**Note for macOS users:** on macOS, the Python version of this package might compile **without** multi-threading capabilities. In order to enable multi-threading support, first install OpenMP:\n```\nbrew install libomp\n```\nAnd then reinstall this package: `pip install --upgrade --no-deps --force-reinstall hpfrec`.\n\n** *\n**IMPORTANT:** the setup script will try to add compilation flag `-march=native`. This instructs the compiler to tune the package for the CPU in which it is being installed (by e.g. using AVX instructions if available), but the result might not be usable in other computers. If building a binary wheel of this package or putting it into a docker image which will be used in different machines, this can be overriden either by (a) defining an environment variable `DONT_SET_MARCH=1`, or by (b) manually supplying compilation `CFLAGS` as an environment variable with something related to architecture. For maximum compatibility (but slowest speed), it's possible to do something like this:\n\n```\nexport DONT_SET_MARCH=1\npip install hpfrec\n```\n\nor, for forcing a maximum-compatibility x86-64 binary:\n```\nexport CFLAGS=\"-march=x86-64\"\npip install hpfrec\n```\n** *\n\n## Sample usage\n\n```python\nimport pandas as pd, numpy as np\nfrom hpfrec import HPF\n\n## Generating sample counts data\nnusers = 10**2\nnitems = 10**2\nnobs   = 10**4\n\nnp.random.seed(1)\ncounts_df = pd.DataFrame({\n\t'UserId' : np.random.randint(nusers, size=nobs),\n\t'ItemId' : np.random.randint(nitems, size=nobs),\n\t'Count' :  (np.random.gamma(1,1, size=nobs) + 1).astype('int32')\n\t})\ncounts_df = counts_df.loc[~counts_df[['UserId', 'ItemId']].duplicated()].reset_index(drop=True)\n\n## Initializing the model object\nrecommender = HPF()\n\n## For stochastic variational inference, need to select batch size (number of users)\nrecommender = HPF(users_per_batch = 20)\n\n## Full function call\nrecommender = HPF(\n\tk=30, a=0.3, a_prime=0.3, b_prime=1.0,\n\tc=0.3, c_prime=0.3, d_prime=1.0, ncores=-1,\n\tstop_crit='train-llk', check_every=10, stop_thr=1e-3,\n\tusers_per_batch=None, items_per_batch=None, step_size=lambda x: 1/np.sqrt(x+2),\n\tmaxiter=100, use_float=True, reindex=True, verbose=True,\n\trandom_seed=None, allow_inconsistent_math=False, full_llk=False,\n\talloc_full_phi=False, keep_data=True, save_folder=None,\n\tproduce_dicts=True, keep_all_objs=True, sum_exp_trick=False\n)\n\n## Fitting the model to the data\nrecommender.fit(counts_df)\n\n## Fitting the model while monitoring a validation set\nrecommender = HPF(stop_crit='val-llk')\nrecommender.fit(counts_df, val_set=counts_df.sample(10**2))\n## Note: a real validation should NEVER be a subset of the training set\n\n## Fitting the model to data in batches passed by the user\nrecommender = HPF(reindex=False, keep_data=False)\nusers_batch1 = np.unique(np.random.randint(10**2, size=20))\nusers_batch2 = np.unique(np.random.randint(10**2, size=20))\nusers_batch3 = np.unique(np.random.randint(10**2, size=20))\nrecommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch1)], nusers=10**2, nitems=10**2)\nrecommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch2)])\nrecommender.partial_fit(counts_df.loc[counts_df.UserId.isin(users_batch3)])\n\n## Making predictions\n# recommender.topN(user=10, n=10, exclude_seen=True) ## not available when using 'partial_fit'\nrecommender.topN(user=10, n=10, exclude_seen=False, items_pool=np.array([1,2,3,4]))\nrecommender.predict(user=10, item=11)\nrecommender.predict(user=[10,10,10], item=[1,2,3])\nrecommender.predict(user=[10,11,12], item=[4,5,6])\n\n## Evaluating Poisson likelihood\nrecommender.eval_llk(counts_df, full_llk=True)\n\n## Determining latent factors for a new user, given her item interactions\nnobs_new = 20\nnp.random.seed(2)\ncounts_df_new = pd.DataFrame({\n\t'ItemId' : np.random.choice(np.arange(nitems), size=nobs_new, replace=False),\n\t'Count' : np.random.gamma(1,1, size=nobs_new).astype('int32')\n\t})\ncounts_df_new = counts_df_new.loc[counts_df_new.Count \u003e 0].reset_index(drop=True)\nrecommender.predict_factors(counts_df_new)\n\n## Adding a user without refitting the whole model\nrecommender.add_user(user_id=nusers+1, counts_df=counts_df_new)\n\n## Updating data for an existing user without refitting the whole model\nchosen_user = counts_df.UserId.values[10]\nrecommender.add_user(user_id=chosen_user, counts_df=counts_df_new, update_existing=True)\n```\n\nIf passing `reindex=True`, all user and item IDs that you pass to `.fit` will be reindexed internally (they need to be hashable types like `str`, `int` or `tuple`), and you  can use these same IDs to make predictions later. The IDs returned by `predict` and `topN` are these IDs passed to `.fit` too.\n\nFor a more detailed example, see the IPython notebook [recommending songs with EchoNest MillionSong dataset](http://nbviewer.jupyter.org/github/david-cortes/hpfrec/blob/master/example/hpfrec_echonest.ipynb) illustrating its usage with the EchoNest TasteProfile dataset.\n\n## Documentation\n\nDocumentation is available at readthedocs: [http://hpfrec.readthedocs.io](http://hpfrec.readthedocs.io/en/latest/)\n\nIt is also internally documented through docstrings (e.g. you can try `help(hpfrec.HPF))`, `help(hpfrec.HPF.fit)`, etc.\n\n## Serializing (pickling) the model\n\nDon't use `pickle` to save an `HPF` object, as it will fail due to problems with lambda functions. Rather, use `dill` instead, which has the same syntax as pickle:\n\n```python\nimport dill\nfrom hpfrec import HPF\n\nh = HPF()\ndill.dump(h, open(\"HPF_obj.dill\", \"wb\"))\nh = dill.load(open(\"HPF_obj.dill\", \"rb\"))\n```\n\n## Speeding up optimization procedure\n\nFor faster fitting and predictions, use SciPy and NumPy libraries compiled against MKL or OpenBLAS. These come by default with MKL in Anaconda installations.\n\nThe constructor for HPF allows some parameters to make it run faster (if you know what you're doing): these are `allow_inconsistent_math=True`, `full_llk=False`, `stop_crit='diff-norm'`, `reindex=False`, `verbose=False`. See the documentation for more details.\n\nUsing stochastic variational inference, which fits the data in smaller batches containing all the user-item interactions only for subsets of users, might converge in fewer iterations (epochs), but the results tend be slightly worse.\n\n## References\n* [1] Gopalan, Prem, Jake M. Hofman, and David M. Blei. \"Scalable Recommendation with Hierarchical Poisson Factorization.\" UAI. 2015.\n* [2] Gopalan, Prem, Jake M. Hofman, and David M. Blei. \"Scalable recommendation with poisson factorization.\" arXiv preprint arXiv:1311.1704 (2013).\n* [3] Hoffman, Matthew D., et al. \"Stochastic variational inference.\" The Journal of Machine Learning Research 14.1 (2013): 1303-1347.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavid-cortes%2Fhpfrec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavid-cortes%2Fhpfrec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavid-cortes%2Fhpfrec/lists"}