{"id":16392375,"url":"https://github.com/dpressel/sgdtk","last_synced_at":"2025-03-23T04:31:45.596Z","repository":{"id":23003849,"uuid":"26354511","full_name":"dpressel/sgdtk","owner":"dpressel","description":"A Java library for Stochastic Gradient Descent (SGD)","archived":false,"fork":false,"pushed_at":"2021-11-01T01:31:54.000Z","size":223,"stargazers_count":21,"open_issues_count":2,"forks_count":14,"subscribers_count":7,"default_branch":"master","last_synced_at":"2025-03-18T17:56:54.767Z","etag":null,"topics":["crf","java","logistic-regression","sgd","svm"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dpressel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-11-08T08:28:04.000Z","updated_at":"2024-12-03T09:48:10.000Z","dependencies_parsed_at":"2022-08-21T18:10:17.606Z","dependency_job_id":null,"html_url":"https://github.com/dpressel/sgdtk","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fsgdtk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fsgdtk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fsgdtk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpressel%2Fsgdtk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dpressel","download_url":"https://codeload.github.com/dpressel/sgdtk/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245056889,"owners_count":20553855,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crf","java","logistic-regression","sgd","svm"],"created_at":"2024-10-11T04:49:43.281Z","updated_at":"2025-03-23T04:31:45.165Z","avatar_url":"https://github.com/dpressel.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"sgdtk\n=====\n\n# A library for Stochastic Gradient Descent\n\n## Design\n\nThe initial goal was a simple, modular implementation of Leon Bottou's SGD experiments as a Java library that can be \nextended and used from within application code.  It contains more stuff in some areas and less in others than the SGD \nexperiments, but the results do match the original's results where they overlap.  This code is designed with a\nrelatively simple API to enable embedding into applications and facilitate some reuse or extension while providing a\nclear concise implementation of SGD.  Attempts are made to encourage the JVM to optimize things wherever possible.\n\nThe design here is notionally split into two types of learning problems, unstructured classifiers (e.g., hinge-loss,\nlog-loss etc) and structured classifiers (currently CRFs only as implemented in the SGD experiments).  \nI have tried to keep the interfaces similar for both types of problems.  It should be possible to add additional\nlearners by extending the Learner/SequentialLearner and the Model/SequentialModel.\n\nThe code supports fast out-of-core processing, inspired by VW, where a thread loads the data from file, \nadds it to a ring buffer, and a processor trains the data.  For multiple passes, the data is reincarnated from a \ncache file (again, like VW) and loaded back onto the ring buffer from the cache.\n\nThere is support for OVA multi-class classification, which is implemented on top of the base routines.\nThe interface follows the same patterns as binary.  In the case of multi-class classification, the labels will \nnot be -1 or 1, but an integer value from 1 ... numClasses stored in the y value of the feature vector.  \nEach score can be retrieved using the Model.score() function, which is an array where each index into the \narray represents the class integer value (-1 to make it zero based).\n\nThe library was developed and tested in Intellij using Java 8, but can be built, installed and run from Maven or Gradle and \nshould work on lower Java versions.  The only dependencies in the library currently are JCommander for easy command\nline parsing, slf4j/logback for logging, and LMAX disruptor for fast contention-free ring buffers.\n\nYou can find more background on this project here:\n\nhttps://rawgit.com/dpressel/Meetups/master/nlp-meetup-2016-02-25/presentation.html\n\n## Loss Functions\n\nThe usual suspects here: hinge, log, squared-hinge, squared (L2) (which is absent from SGD)\n\n## Performance\n\nA significant amount of time has gone into profiling the code and optimizing performance.  \nThe primary bottleneck for performance on large datasets using a good SGD linear classifier implementation\ntends to be the IO portion (not the computation). Note this is true in Leon Bottou's SGD code, which reads \nall of the data into memory upfront. Due to the IO bottleneck, I tried to ensure that reading the input file is as\nfast as possible, and like in VW, overlapping the IO and the computation via a shared ring buffer allow simultaneous\nreading/loading and processing.\n\nRegarding the computational aspect, this employs many of the tricks from Leon Bottou's original implementation which makes\nit significantly faster than naive implementations (though perhaps more complex).\n\nI considered switching the basic linear algebra routines over to use jblas, but due to the overhead of JNI transfer,\nthe native operations are actually slower and the jblas package's JavaBlas class (which performs the typical BLAS\noperations in java) is equivalent to what is performed here, so for simplicity, all operations are performed within the library.\n\n## Simple example (binary SVM)\n\n```{java}\nModelFactory modelFactory = new LinearModelFactory();\nLearner learner = new SGDLearner(new HingeLoss(), lambda, eta, modelFactory);\nint featureVectorSz = reader.getLargestVectorSeen();\nModel model = learner.create(featureVectorSz);\n\ndouble totalTrainingElapsed = 0.;\n\nfor (int i = 0; i \u003c params.epochs; ++i)\n{\n    Collections.shuffle(trainingSet);\n    System.out.println(\"~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\");\n    System.out.println(\"EPOCH: \" + (i + 1));\n    Metrics metrics = new Metrics();\n    double t0 = System.currentTimeMillis();\n    learner.trainEpoch(model, trainingSet);\n    double elapsedThisEpoch = (System.currentTimeMillis() - t0) /1000.;\n    System.out.println(\"Epoch training time \" + elapsedThisEpoch + \"s\");\n    totalTrainingElapsed += elapsedThisEpoch;\n\n    learner.eval(model, trainingSet, metrics);\n    showMetrics(metrics, \"Training Set Eval Metrics\");\n    metrics.clear();\n\n    if (evalSet != null)\n    {\n        learner.eval(model, evalSet, metrics);\n        showMetrics(metrics, \"Test Set Eval Metrics\");\n    }\n} \n\nSystem.out.println(\"Total training time \" + totalTrainingElapsed + \"s\");\nmodel.save(new FileOutputStream(\"svm.model\"));\n\n```\n\n## Example showing overlapped IO\n\n```{java}\nModelFactory modelFactory = new LinearModelFactory();\nLearner learner = new SGDLearner(lossFunction, lambda, eta, modelFactory);\nOverlappedTrainingRunner asyncTrainer = new OverlappedTrainingRunner(learner);\nasyncTrainer.setEpochs(params.epochs);\nasyncTrainer.setBufferSz(params.bufferSize);\nasyncTrainer.setLearnerUserData(featureVectorSz);\n\nSVMLightFileFeatureProvider evalReader = new SVMLightFileFeatureProvider();\nList\u003cFeatureVector\u003e evalSet = evalReader.load(new File(params.eval));\n\nasyncTrainer.addListener(new TrainingEventListener()\n{\n    @Override\n    public void onEpochEnd(Learner learner, Model model, double sec)\n    {\n        if (evalSet != null)\n        {\n            Metrics metrics = new Metrics();\n            learner.eval(model, evalSet, metrics);\n            showMetrics(metrics, \"Test Set Eval Metrics\");\n        }\n    }\n});\n\nasyncTrainer.start();\n            \nSVMLightFileFeatureProvider fileReader = new SVMLightFileFeatureProvider();\nfileReader.open(trainFile);\n\nFeatureVector fv;\n\nwhile ((fv = fileReader.next()) != null)\n{\n    asyncTrainer.add(fv);\n}\n\nModel model = asyncTrainer.finish();\ndouble elapsed = (System.currentTimeMillis() - t0) / 1000.;\nSystem.out.println(\"Overlapped training completed in \" + elapsed + \"s\");\nmodel.save(new FileOutputStream(\"svm.model\"));\n\n```\n\n## Other examples\n\nThere are some complete command line programs contained in the 'exec' area that can be used for different types of simple tasks, but this is mainly intended as a library that you can use to integrate SGD into your own applications.  I used this library to implement the NBSVM algorithm using SGD and making use of overlapped IO (https://github.com/dpressel/nbsvm-xl).  I also wrote a simple Torch 'nn'-like neural net package in Java which depends on this library (https://github.com/dpressel/n3rd).\n\n## Building\n\n* `./gradlew build`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpressel%2Fsgdtk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdpressel%2Fsgdtk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpressel%2Fsgdtk/lists"}