{"id":13937144,"url":"https://github.com/tmadl/semisup-learn","last_synced_at":"2025-07-19T23:30:48.276Z","repository":{"id":35474269,"uuid":"39742851","full_name":"tmadl/semisup-learn","owner":"tmadl","description":"Semi-supervised learning frameworks for python, which allow fitting scikit-learn classifiers to partially labeled data","archived":false,"fork":false,"pushed_at":"2021-11-08T11:30:25.000Z","size":981,"stargazers_count":503,"open_issues_count":15,"forks_count":154,"subscribers_count":25,"default_branch":"master","last_synced_at":"2024-08-08T23:26:27.508Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tmadl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-07-26T21:30:40.000Z","updated_at":"2024-05-27T06:34:40.000Z","dependencies_parsed_at":"2022-07-29T19:09:43.703Z","dependency_job_id":null,"html_url":"https://github.com/tmadl/semisup-learn","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tmadl%2Fsemisup-learn","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tmadl%2Fsemisup-learn/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tmadl%2Fsemisup-learn/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tmadl%2Fsemisup-learn/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tmadl","download_url":"https://codeload.github.com/tmadl/semisup-learn/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226693903,"owners_count":17667757,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-07T23:03:19.681Z","updated_at":"2024-11-27T05:30:55.000Z","avatar_url":"https://github.com/tmadl.png","language":"Python","funding_links":[],"categories":["Python","Table of Contents"],"sub_categories":[],"readme":"Semi-supervised learning frameworks for Python\n===============\n\nThis project contains Python implementations for semi-supervised\nlearning, made compatible with scikit-learn, including\n\n- **Contrastive Pessimistic Likelihood Estimation (CPLE)** (based on - but not equivalent to - [Loog, 2015](http://arxiv.org/abs/1503.00269)), a `safe' framework applicable for all classifiers which can yield prediction probabilities\n(safe here means that the model trained on both labelled and unlabelled data should not be worse than models trained only on the labelled data)\n\n- Self learning (self training), a naive semi-supervised learning framework applicable for any classifier (iteratively labelling the unlabelled instances using a trained classifier, and then re-training it on the resulting dataset - see e.g. http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf )\n\n- Semi-Supervised Support Vector Machine (S3VM) - a simple scikit-learn compatible wrapper for the QN-S3VM code developed by \nFabian Gieseke, Antti Airola, Tapio Pahikkala, Oliver Kramer (see http://www.fabiangieseke.de/index.php/code/qns3vm ) \nThis method was included for comparison\n\nThe first method is a novel extension of [Loog, 2015](http://arxiv.org/abs/1503.00269) for any discriminative classifier (the differences to the original CPLE are explained below). The last two methods are only included for comparison. \n\n \nThe advantages of the CPLE framework compared to other semi-supervised learning approaches include  \n\n- it is a **generally applicable framework (works with scikit-learn classifiers which allow per-sample weights)**\n\n- it needs low memory (as opposed to e.g. Label Spreading which needs O(n^2)), and \n\n- it makes no additional assumptions except for the ones made by the choice of classifier \n\nThe main disadvantage is high computational complexity. Note: **this is an early stage research project, and work in progress** (it is by no means efficient or well tested)!\n\nIf you need faster results, try the Self Learning framework (which is a naive approach but much faster):\n\n```python\nfrom frameworks.SelfLearning import *\n\nany_scikitlearn_classifier = SVC()\nssmodel = SelfLearningModel(any_scikitlearn_classifier)\nssmodel.fit(X, y)\n```\n\nUsage\n===============\n\nThe project requires [scikit-learn](http://scikit-learn.org/stable/install.html), [matplotlib](http://matplotlib.org/users/installing.html) and [NLopt](http://ab-initio.mit.edu/wiki/index.php/NLopt_Installation) to run.\n\nUsage example:\n\n```python\n# load `Lung cancer' dataset from mldata.org\ncancer = fetch_mldata(\"Lung cancer (Ontario)\")\nX = cancer.target.T\nytrue = np.copy(cancer.data).flatten()\nytrue[ytrue\u003e0]=1\n\n# label a few points \nlabeled_N = 4\nys = np.array([-1]*len(ytrue)) # -1 denotes unlabeled point\nrandom_labeled_points = random.sample(np.where(ytrue == 0)[0], labeled_N/2)+\\\n                        random.sample(np.where(ytrue == 1)[0], labeled_N/2)\nys[random_labeled_points] = ytrue[random_labeled_points]\n\n# supervised score\nbasemodel = SGDClassifier(loss='log', penalty='l1') # scikit logistic regression\nbasemodel.fit(X[random_labeled_points, :], ys[random_labeled_points])\nprint \"supervised log.reg. score\", basemodel.score(X, ytrue)\n\n# fast (but naive, unsafe) self learning framework\nssmodel = SelfLearningModel(basemodel)\nssmodel.fit(X, ys)\nprint \"self-learning log.reg. score\", ssmodel.score(X, ytrue)\n\n# semi-supervised score (base model has to be able to take weighted samples)\nssmodel = CPLELearningModel(basemodel)\nssmodel.fit(X, ys)\nprint \"CPLE semi-supervised log.reg. score\", ssmodel.score(X, ytrue)\n\n# semi-supervised score, RBF SVM model\nssmodel = CPLELearningModel(sklearn.svm.SVC(kernel=\"rbf\", probability=True), predict_from_probabilities=True) # RBF SVM\nssmodel.fit(X, ys)\nprint \"CPLE semi-supervised RBF SVM score\", ssmodel.score(X, ytrue)\n\n# supervised log.reg. score 0.410256410256\n# self-learning log.reg. score 0.461538461538\n# semi-supervised log.reg. score 0.615384615385\n# semi-supervised RBF SVM score 0.769230769231\n```\n\n\nExamples\n===============\n\nTwo-class classification examples with 56 unlabelled (small circles in the plot) and 4 labelled (large circles in the plot) data points. \nPlot titles show classification accuracies (percentage of data points correctly classified by the model)\n\nIn the second example, **the state-of-the-art S3VM performs worse than the purely supervised SVM**, while the CPLE SVM (by means of the \npessimistic assumption) provides increased accuracy.\n\nQuadratic Discriminant Analysis (from left to right: supervised QDA, Self learning QDA, pessimistic CPLE QDA) \n![Comparison of supervised QDA with CPLE QDA](qdaexample.png)\n\nSupport Vector Machine (from left to right: supervised SVM, S3VM [(Gieseke et al., 2012)](http://www.sciencedirect.com/science/article/pii/S0925231213003706), pessimistic CPLE SVM)\n![Comparison of supervised SVM, S3VM, and CPLE SVM](svmexample1.png)\n \nSupport Vector Machine (from left to right: supervised SVM, S3VM [(Gieseke et al., 2012)](http://www.sciencedirect.com/science/article/pii/S0925231213003706), pessimistic CPLE SVM)\n![Comparison of supervised SVM, S3VM, and CPLE SVM](svmexample2.png)\n\nMotivation\n===============\n\nCurrent semi-supervised learning approaches require strong assumptions, and perform badly if those \nassumptions are violated (e.g. low density assumption, clustering assumption). In some cases, they can perform worse than a supervised classifier trained only on the labeled exampels. Furthermore, the vast majority require O(N^2) memory.  \n\n[(Loog, 2015)](http://arxiv.org/abs/1503.00269) has suggested an elegant framework (called Contrastive Pessimistic Likelihood Estimation / CPLE) which \n**only uses assumptions intrinsic to the chosen classifier**, and thus allows choosing likelihood-based classifiers which fit the domain / data \ndistribution at hand, and can work even if some of the assumptions mentioned above are violated. The idea is to pessimistically assign soft labels \nto the unlabelled data, such that the improvement over the supervised version is minimal (i.e. assume the worst case for the unknown labels).\n\nThe parameters in CPLE can be estimated according to:\n![CPLE Equation](eq1.png)\n\nThe original CPLE framework is only applicable to likelihood-based classifiers, and (Loog, 2015) only provides solutions for Linear Discriminant Analysis and the Nearest Mean Classifier.\n\nThe CPLE implementation in this project\n===============\n\nBuilding on this idea, this project contains a general semi-supervised learning framework allowing plugging in **any classifier** which allows 1) instance weighting and 2) can generate probability \nestimates (such probability estimates can also be provided by [Platt scaling](https://en.wikipedia.org/wiki/Platt_scaling) for classifiers which don't support them. Also, an experimental feature \nis included to make the approach work with classifiers not supporting instance weighting).\n\nIn order to make the approach work with any classifier, the discriminative likelihood (DL) is used instead of the generative likelihood, which is the first major difference to (Loog, 2015). The second \ndifference is that only the unlabelled data is included in the first term of the minimization objective (point 2. below), which leads to pessimistic minimization of the DL over the unlabelled data, but maximization\nof the DL over the labelled data. (Note that the DL is equivalent to the negative log loss for binary classifiers with probabilistic predictions - see below.) \n\n![CPLE Equation](alg1.png)\n\nThe resulting semi-supervised learning framework is highly computationally expensive, but has the advantages of being a generally applicable framework, needing low memory, and making no additional assumptions except for the ones made by the choice of classifier \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftmadl%2Fsemisup-learn","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftmadl%2Fsemisup-learn","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftmadl%2Fsemisup-learn/lists"}