{"id":13748097,"url":"https://github.com/ycjuan/libffm","last_synced_at":"2025-03-25T17:33:24.789Z","repository":{"id":33359564,"uuid":"37004390","full_name":"ycjuan/libffm","owner":"ycjuan","description":"A Library for Field-aware Factorization Machines","archived":true,"fork":false,"pushed_at":"2024-08-16T04:32:57.000Z","size":51,"stargazers_count":1597,"open_issues_count":23,"forks_count":460,"subscribers_count":71,"default_branch":"master","last_synced_at":"2024-10-29T18:02:49.987Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ycjuan.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-07T04:20:13.000Z","updated_at":"2024-10-23T15:01:48.000Z","dependencies_parsed_at":"2022-07-16T07:47:00.115Z","dependency_job_id":null,"html_url":"https://github.com/ycjuan/libffm","commit_stats":null,"previous_names":[],"tags_count":13,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ycjuan%2Flibffm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ycjuan%2Flibffm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ycjuan%2Flibffm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ycjuan%2Flibffm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ycjuan","download_url":"https://codeload.github.com/ycjuan/libffm/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245511066,"owners_count":20627309,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-03T07:00:34.290Z","updated_at":"2025-03-25T17:33:24.558Z","avatar_url":"https://github.com/ycjuan.png","language":"C++","funding_links":[],"categories":["Recommender Systems","推荐系统"],"sub_categories":[],"readme":"Table of Contents\r\n=================\r\n\r\n- What is LIBFFM\r\n- Overfitting and Early Stopping\r\n- Installation\r\n- Data Format\r\n- Command Line Usage\r\n- Examples\r\n- OpenMP and SSE\r\n- Building Windows Binaries\r\n- FAQ\r\n\r\n\r\nWhat is LIBFFM\r\n==============\r\n\r\nLIBFFM is a library for field-aware factorization machine (FFM). \r\n\r\nField-aware factorization machine is a effective model for CTR prediction. It has been used to win the top-3 positions\r\nof following competitions:\r\n\r\n    * Criteo: https://www.kaggle.com/c/criteo-display-ad-challenge\r\n\r\n    * Avazu: https://www.kaggle.com/c/avazu-ctr-prediction\r\n\r\n    * Outbrain: https://www.kaggle.com/c/outbrain-click-prediction\r\n\r\n    * RecSys 2015: http://dl.acm.org/citation.cfm?id=2813511\u0026dl=ACM\u0026coll=DL\u0026CFID=941880276\u0026CFTOKEN=60022934\r\n\r\nYou can find more information about FFM in the following paper / slides:\r\n\r\n    * http://www.csie.ntu.edu.tw/~r01922136/slides/ffm.pdf\r\n\r\n    * http://www.csie.ntu.edu.tw/~cjlin/papers/ffm.pdf\r\n\r\n    * https://arxiv.org/abs/1701.04099\r\n\r\n\r\nOverfitting and Early Stopping\r\n==============================\r\n\r\nFFM is prone to overfitting, and the solution we have so far is early stopping. See how FFM behaves on a certain data\r\nset:\r\n\r\n    \u003e ffm-train -p va.ffm -l 0.00002 tr.ffm\r\n    iter   tr_logloss   va_logloss\r\n       1      0.49738      0.48776\r\n       2      0.47383      0.47995\r\n       3      0.46366      0.47480\r\n       4      0.45561      0.47231\r\n       5      0.44810      0.47034\r\n       6      0.44037      0.47003\r\n       7      0.43239      0.46952\r\n       8      0.42362      0.46999\r\n       9      0.41394      0.47088\r\n      10      0.40326      0.47228\r\n      11      0.39156      0.47435\r\n      12      0.37886      0.47683\r\n      13      0.36522      0.47975\r\n      14      0.35079      0.48321\r\n      15      0.33578      0.48703\r\n\r\n\r\nWe see the best validation loss is achieved at 7th iteration. If we keep training, then overfitting begins. It is worth\r\nnoting that increasing regularization parameter do not help:\r\n\r\n    \u003e ffm-train -p va.ffm -l 0.0002 -t 50 -s 12 tr.ffm\r\n    iter   tr_logloss   va_logloss\r\n       1      0.50532      0.49905\r\n       2      0.48782      0.49242\r\n       3      0.48136      0.48748\r\n                 ...\r\n      29      0.42183      0.47014\r\n                 ...\r\n      48      0.37071      0.47333\r\n      49      0.36767      0.47374\r\n      50      0.36472      0.47404\r\n\r\n\r\nTo avoid overfitting, we recommend always provide a validation set with option `-p.' You can use option `--auto-stop' to\r\nstop at the iteration that reaches the best validation loss:\r\n\r\n    \u003e ffm-train -p va.ffm -l 0.00002 --auto-stop tr.ffm\r\n    iter   tr_logloss   va_logloss\r\n       1      0.49738      0.48776\r\n       2      0.47383      0.47995\r\n       3      0.46366      0.47480\r\n       4      0.45561      0.47231\r\n       5      0.44810      0.47034\r\n       6      0.44037      0.47003\r\n       7      0.43239      0.46952\r\n       8      0.42362      0.46999\r\n    Auto-stop. Use model at 7th iteration.\r\n\r\n\r\nInstallation\r\n============\r\n\r\nRequirement: It requires a C++11 compatible compiler. We also use OpenMP to provide multi-threading. If OpenMP is not\r\navailable on your platform, please refer to section `OpenMP and SSE.'\r\n\r\n- Unix-like systems:\r\n  Typeype `make' in the command line.\r\n\r\n- Windows:\r\n  See `Building Windows Binaries' to compile.\r\n\r\n\r\n\r\nData Format\r\n===========\r\n\r\nThe data format of LIBFFM is:\r\n\r\n\u003clabel\u003e \u003cfield1\u003e:\u003cfeature1\u003e:\u003cvalue1\u003e \u003cfield2\u003e:\u003cfeature2\u003e:\u003cvalue2\u003e ...\r\n.\r\n.\r\n.\r\n\r\n`field' and `feature' should be non-negative integers. See an example `bigdata.tr.txt.'\r\n\r\nIt is important to understand the difference between `field' and `feature'. For example, if we have a raw data like this:\r\n\r\nClick  Advertiser  Publisher\r\n=====  ==========  =========\r\n    0        Nike        CNN\r\n    1        ESPN        BBC\r\n\r\nHere, we have \r\n\r\n    * 2 fields: Advertiser and Publisher\r\n\r\n    * 4 features: Advertiser-Nike, Advertiser-ESPN, Publisher-CNN, Publisher-BBC\r\n\r\nUsually you will need to build two dictionares, one for field and one for features, like this:\r\n    \r\n    DictField[Advertiser] -\u003e 0\r\n    DictField[Publisher]  -\u003e 1\r\n    \r\n    DictFeature[Advertiser-Nike] -\u003e 0\r\n    DictFeature[Publisher-CNN]   -\u003e 1\r\n    DictFeature[Advertiser-ESPN] -\u003e 2\r\n    DictFeature[Publisher-BBC]   -\u003e 3\r\n\r\nThen, you can generate FFM format data:\r\n\r\n    0 0:0:1 1:1:1\r\n    1 0:2:1 1:3:1\r\n\r\nNote that because these features are categorical, the values here are all ones.\r\n\r\n\r\nCommand Line Usage\r\n==================\r\n\r\n-   `ffm-train'\r\n\r\n    usage: ffm-train [options] training_set_file [model_file]\r\n\r\n    options:\r\n    -l \u003clambda\u003e: set regularization parameter (default 0.00002)\r\n    -k \u003cfactor\u003e: set number of latent factors (default 4)\r\n    -t \u003citeration\u003e: set number of iterations (default 15)\r\n    -r \u003ceta\u003e: set learning rate (default 0.2)\r\n    -s \u003cnr_threads\u003e: set number of threads (default 1)\r\n    -p \u003cpath\u003e: set path to the validation set\r\n    --quiet: quiet model (no output)\r\n    --no-norm: disable instance-wise normalization\r\n    --auto-stop: stop at the iteration that achieves the best validation loss (must be used with -p)\r\n\r\n    By default we do instance-wise normalization. That is, we normalize the 2-norm of each instance to 1. You can use\r\n    `--no-norm' to disable this function.\r\n    \r\n    A binary file `training_set_file.bin' will be generated to store the data in binary format.\r\n\r\n    Because FFM usually need early stopping for better test performance, we provide an option `--auto-stop' to stop at\r\n    the iteration that achieves the best validation loss. Note that you need to provide a validation set with `-p' when\r\n    you use this option.\r\n\r\n\r\n-   `ffm-predict'\r\n\r\n    usage: ffm-predict test_file model_file output_file\r\n\r\n\r\n\r\nExamples\r\n========\r\n\r\nDownload a toy data from:\r\n\r\n    zip: https://drive.google.com/open?id=1HZX7zSQJy26hY4_PxSlOWz4x7O-tbQjt\r\n\r\n    tar.gz: https://drive.google.com/open?id=12-EczjiYGyJRQLH5ARy1MXRFbCvkgfPx\r\n\r\nThis dataset is subsampled 1% from Criteo's challenge.\r\n\r\n\u003e tar -xzf libffm_toy.tar.gz\r\n\r\nor \r\n\r\n\u003e unzip libffm_toy.zip\r\n\r\n\r\n\u003e ./ffm-train -p libffm_toy/criteo.va.r100.gbdt0.ffm libffm_toy/criteo.tr.r100.gbdt0.ffm model\r\n\r\ntrain a model using the default parameters\r\n\r\n\r\n\u003e ./ffm-predict libffm_toy/criteo.va.r100.gbdt0.ffm model output\r\n\r\ndo prediction\r\n\r\n\r\n\u003e ./ffm-train -l 0.0001 -k 15 -t 30 -r 0.05 -s 4 --auto-stop -p libffm_toy/criteo.va.r100.gbdt0.ffm libffm_toy/criteo.tr.r100.gbdt0.ffm model\r\n\r\ntrain a model using the following parameters:\r\n\r\n    regularization cost = 0.0001\r\n    latent factors = 15\r\n    iterations = 30\r\n    learning rate = 0.3\r\n    threads = 4\r\n    let it auto-stop\r\n\r\n\r\nOpenMP and SSE\r\n==============\r\n\r\nWe use OpenMP to do parallelization. If OpenMP is not available on your\r\nplatform, then please comment out the following lines in Makefile.\r\n\r\n    DFLAG += -DUSEOMP\r\n    CXXFLAGS += -fopenmp\r\n\r\nNote: Please run `make clean all' if these flags are changed.\r\n\r\nWe use SSE instructions to perform fast computation. If you do not want to use it, comment out the following line:\r\n\r\n    DFLAG += -DUSESSE\r\n\r\nThen, run `make clean all'\r\n\r\n\r\n\r\nBuilding Windows Binaries\r\n=========================\r\n\r\nThe Windows part is maintained by different maintainer, so it may not always support the latest version.\r\n\r\nThe latest version it supports is: v1.21\r\n\r\nTo build them via command-line tools of Visual C++, use the following steps:\r\n\r\n1. Open a DOS command box (or Developer Command Prompt for Visual Studio) and go to LIBFFM directory. If environment\r\nvariables of VC++ have not been set, type\r\n\r\n\"C:\\Program Files (x86)\\Microsoft Visual Studio 12.0\\VC\\bin\\amd64\\vcvars64.bat\"\r\n\r\nYou may have to modify the above command according which version of VC++ or\r\nwhere it is installed.\r\n\r\n2. Type\r\n\r\nnmake -f Makefile.win clean all\r\n\r\n\r\nFAQ\r\n===\r\n\r\nQ: Why I have the same model size when k = 1 and k = 4?\r\n\r\nA: This is because we use SSE instructions. In order to use SSE, the memory need to be aligned. So even you assign k =\r\n   1, we still fill some dummy zeros from k = 2 to 4.\r\n\r\n\r\nQ: Why the logloss is slightly different on the same data when I run the program two or more times when I use multi-threading\r\n\r\nA: When there are more then one thread, the program becomes non-deterministic. To make it determinisitc you can only use one thread.\r\n\r\n\r\nContributors\r\n============\r\n\r\nYuchin Juan, Wei-Sheng Chin, and Yong Zhuang\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fycjuan%2Flibffm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fycjuan%2Flibffm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fycjuan%2Flibffm/lists"}