{"id":33253385,"url":"https://github.com/anttttti/Wordbatch","last_synced_at":"2025-11-21T18:02:56.911Z","repository":{"id":62589373,"uuid":"66829087","full_name":"anttttti/Wordbatch","owner":"anttttti","description":"Python library for distributed AI processing pipelines, using swappable scheduler backends.","archived":false,"fork":false,"pushed_at":"2021-12-29T20:30:47.000Z","size":1447,"stargazers_count":416,"open_issues_count":13,"forks_count":60,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-09-22T16:28:38.005Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/anttttti.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-08-29T09:14:58.000Z","updated_at":"2025-09-19T19:35:26.000Z","dependencies_parsed_at":"2022-11-03T20:27:24.377Z","dependency_job_id":null,"html_url":"https://github.com/anttttti/Wordbatch","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/anttttti/Wordbatch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anttttti%2FWordbatch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anttttti%2FWordbatch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anttttti%2FWordbatch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anttttti%2FWordbatch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/anttttti","download_url":"https://codeload.github.com/anttttti/Wordbatch/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/anttttti%2FWordbatch/sbom","scorecard":{"id":199912,"data":{"date":"2025-08-11","repo":{"name":"github.com/anttttti/Wordbatch","commit":"82c1cbec04f1848b4efe58f4a05d2c650721114c"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.2,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: containerImage not pinned by hash: Dockerfile:1","Info:   0 out of   1 containerImage dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Info: FSF or OSI recognized license: GNU General Public License v2.0: LICENSE.txt:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":-1,"reason":"internal error: error during branchesHandler.setup: internal error: githubv4.Query: Resource not accessible by integration","details":null,"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-16T22:38:09.704Z","repository_id":62589373,"created_at":"2025-08-16T22:38:09.705Z","updated_at":"2025-08-16T22:38:09.705Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285663970,"owners_count":27210638,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-21T02:00:06.175Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-11-17T01:00:33.920Z","updated_at":"2025-11-21T18:02:56.903Z","avatar_url":"https://github.com/anttttti.png","language":"Python","readme":"===============\nWordbatch 1.4.9\n===============\n\nOverview\n========\n\nPython library for distributed AI processing pipelines, using swappable scheduler backends.\n\nWordbatch parallelizes task pipelines as minibatches processed by a chosen scheduler backend. This allows\nthe user to develop AI programs on a local workstation or laptop, and scale the same\nsolution on a cluster or the cloud, simply by changing the pipeline backend to a distributed scheduler such as Spark,\nDask or Ray. A backend can be chosen based on performance characteristics on a particular task, and swapped for\ndifferent situations. For example, an AI model can be trained using a distributed backend, and then debugged or\ndeployed using a single serial process.\n\nThe library is organized around the orchestrator class Batcher, and Sklearn-compatible components,\nsplit into Pipelines, Transformers, Extractors and Models. These extend the Scikit-learn API with a\nfit_partial()-method, that enables transformers and models to be used in a streaming fashion.\nThe current set of components has been developed mostly for text processing tasks, but components for other domains\ncan be developed based on the available classes.\n\nRequirements\n============\nLinux / Windows / macOS. Python 3.6 / 3.7\n\nInstallation\n============\npip install wordbatch\n\nmacOS: compile using GCC-7 (https://github.com/anttttti/Wordbatch/issues/1)\n\nlinux: make sure GCC and its required libraries are installed before installing Wordbatch\n| sudo apt install gcc\n| sudo apt-get update\n| sudo apt-get install --reinstall build-essential\n\nGetting started\n===============\n\n| from wordbatch.models import FTRL\n| from wordbatch.extractors import WordBag\n| from wordbatch.pipelines import WordBatch\n| from wordbatch.batcher import Batcher\n|\n| wb= WordBatch(extractor=WordBag(hash_ngrams=0, norm= 'l2', tf= 'binary', idf= 50.0),\n|               batcher=Batcher(backend=\"multiprocessing\"))\n|\n| clf= FTRL(alpha=1.0, beta=1.0, L1=0.00001, L2=1.0, D=2 ** 25, iters=1)\n|\n| train_texts= [\"Cut down a tree with a herring? It can't be done.\",\n|              \"Don't say that word.\",\n|              \"How can we not say the word if you don't tell us what it is?\"]\n| train_labels= [1, 0, 1]\n| test_texts= [\"Wait! I said it! I said it! Ooh! I said it again!\"]\n|\n| clf.fit(wb.fit_transform(train_texts), train_labels)\n| print(clf.predict(wb.transform(test_texts)))\n|\n| import ray\n| ray.init()\n| wb.batcher.backend= \"ray\"\n| wb.batcher.backend_handle= ray\n|\n| clf.fit(wb.fit_transform(train_texts), train_labels)\n| print(clf.predict(wb.transform(test_texts)))\n\n\nComponents\n==========\n\nBatcher\n-------\nBatcher orchestrates MapReduce processing of tasks using a backend, by splitting input data into separately processed\nminibatches. Currently three local (serial, multiprocessing, Loky) and three distributed backends (Spark, Dask,\nRay) are supported. Some distributed backends will process the tasks concurrently as a graph of lazily evaluated\nfutures, with Batcher dynamically sending the graph for the backend to process. All three supported distributed\nbackends allow real time monitoring of the processing pipeline using the backend's own GUI.\n\n\nPipelines\n---------\nPipelines are classes that send functions, methods and classes to Batcher for processing. Unlike other components in\nWordbatch, pipelines contain a reference to Batcher, and are never referenced themselves in the calls sent to Batcher.\nThis prevents trying to serialize and send the backend handle itself. The simplest pipeline is Apply,\nwhich processes a function or method over the input data row-by-row. WordBatch is a full complex pipeline for text\nprocessing, with optional steps such as text normalization, spelling correction, stemming, feature extraction, and\nLZ4-caching of results.\n\n\nTransformers\n------------\nTransformers are transformer classes extending the Scikit-learn API, by accepting a Batcher instance as argument\nof fit and transform methods. Transformers won't store Batcher references, allowing the transformer objects to be sent\nto distributed workers. This allows transformers to do MapReduce operations as part of its methods, for example\ngathering a dictionary of words from data when fitting a Dictionary. The current set of transformers are\ntext-specific classes, such as Dictionary, Tokenizer and TextNormalizer.\n\n\nExtractors\n----------\nExtractors are transformer classes which don't directly call Batcher. Since extractors can't call Batcher directly,\nthey are mostly immutable and used for their transform() method calls distributed using a pipeline. The current set of\nextractors is Cython-optimized, and aside from PandasHash intended for text feature extraction. These are:\n\n- WordHash is wrapper for Scikit-learn HashingVectorizer, extended with option for LZ4-caching\n- WordBag is a flexible alternative to Wordhash, with options such as IDF and per n-gram order weighting\n- WordSeq provides sequences of word integers, as used by deep learning language models\n- WordVec embeds words into word vector representations\n- PandasHash extracts hashed features from a Pandas DataFrame, similar to VowpalWabbit's feature extraction\n\n\nModels\n------\nModels are predictive models such as classifiers. Similar to extractors, they don't directly call Batcher, but are\nScikit-learn compatible and distributed using a pipeline if needed. Currently four\nOpenMP-multithreaded L1\u0026L2-regularized online learning models are provided, for single-label regression and\nclassification:\n\n- FTRL : Linear model Proximal-FTRL that has become the most popular algorithm for online learning of linear models in Kaggle competions. The Cython-optimized implementation should be the fastest available version of FTRL.\n- FM_FTRL : Factorization Machines. Linear effects estimated with FTRL and factor effects estimated with adaptive SGD. Prediction and estimation multithreaded across factors.\n- NN_Relu_H1 : Neural Network with 1 hidden layer and Rectified Linear Unit activations, estimated with adaptive SGD. Prediction and estimation multithreaded across hidden layer.\n- NN_Relu_H2: Neural Network with 2 hidden layers and Rectified Linear Unit activations, estimated with adaptive SGD. Prediction multithreaded across 2nd hidden layer, estimation across 1st hidden layer outputs.\n\nThe adaptive SGD optimizer works like Adagrad, but pools the adaptive learning rates across hidden nodes using the same\nfeature. This makes learning more robust and requires less memory. FM_FTRL uses AVX2-optimization, so that processors\nsupporting AVX2 will run the factorization model up to four times faster.\n\nExample scripts\n===============\n\nThe directory /scripts/ contains scripts for demonstrating and testing basic uses of the toolkit. To run the scripts\none should first install the dependencies: Keras, NLTK, TextBlob, Pandas, Ray, Dask Distributed and PySpark.\nThe scripts also use the TripAdvisor dataset (http://times.cs.uiuc.edu/~wang296/Data/), and the\nprecomputed word embeddings glove.twitter.27B.100d and glove.6B.50d (http://nlp.stanford.edu/projects/glove/). Test\ndata from Crowdflower Open data \u0026 Kaggle is provided in the /data directory.\n\nAirline Classification Example\n------------------------------\nclassify_airline_sentiment.py shows training and combining predictions with four classifier scripts that use the\nWordbatch extractors and models: wordhash_regressor.py, wordbag_regressor.py, wordseq_regressor.py and\nwordvec_regressor.py. The header part of the script can be modified to choose the backend. By default Ray is used and\npassed to the other scripts.\n\nBackends Benchmark Example\n--------------------------\nbackends_benchmark.py shows how to benchmark different backends on two simple pipeline tasks:\nusing ApplyBatch with Scikit-learn HashingVectorizer, and running WordBatch Pipeline with most of its possible\nprocessing steps. Dask and Spark are commented out by default, as these need command-line configuration.\nAll three distributed backends can be configured to run across a distributed cluster, as done in the\ncommented-out code.\n\n\nContributors\n============\nAntti Puurula\n\nAnders Topper\n\nCheng-Tsung Liu\n","funding_links":[],"categories":["Feature Extraction"],"sub_categories":["Text/NLP"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanttttti%2FWordbatch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fanttttti%2FWordbatch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fanttttti%2FWordbatch/lists"}