{"id":15882680,"url":"https://github.com/pr38/dask_tfidf","last_synced_at":"2026-05-02T23:34:09.075Z","repository":{"id":163693204,"uuid":"639135388","full_name":"pr38/dask_tfidf","owner":"pr38","description":"A Dask native implementation of 'Term Frequency Inverse Document Frequency' for dask-ml and scikit-learn","archived":false,"fork":false,"pushed_at":"2023-06-12T20:41:26.000Z","size":8,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-15T17:47:54.330Z","etag":null,"topics":["dask","dask-ml","distributed-computing","machine-learning","python","scikit-learn"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pr38.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-10T20:48:10.000Z","updated_at":"2024-03-14T00:10:06.000Z","dependencies_parsed_at":"2024-03-15T13:55:52.411Z","dependency_job_id":null,"html_url":"https://github.com/pr38/dask_tfidf","commit_stats":{"total_commits":6,"total_committers":2,"mean_commits":3.0,"dds":0.5,"last_synced_commit":"f0e6af943c11276a899a71fa8539bb3ae3b2929c"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pr38%2Fdask_tfidf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pr38%2Fdask_tfidf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pr38%2Fdask_tfidf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pr38%2Fdask_tfidf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pr38","download_url":"https://codeload.github.com/pr38/dask_tfidf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246763863,"owners_count":20829798,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dask","dask-ml","distributed-computing","machine-learning","python","scikit-learn"],"created_at":"2024-10-06T04:06:23.330Z","updated_at":"2025-10-19T02:51:27.651Z","avatar_url":"https://github.com/pr38.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dask_tfidf\nA Dask native implementation of 'Term Frequency Inverse Document Frequency' for dask-ml and scikit-learn\n\nInstall\n-------\n\u003epip install dask-tfidf\n\nThis project simply includes a DaskTfidfTransformer class, which is more or less a dask equivalent for sklearn' TfidfTransformer.\nIt assumes a dask array of counted tokens, like the kind that dask_ml's CountVectorizer class creates.\nDaskTfidfTransformer, has all the parameters/hyperparameters as sklearn' TfidfTransformer; namley 'norm', 'use_idf', 'smooth_idf' and 'sublinear_tf'.\nDaskTfidfTransformer output should be nearly identically to the TfidfTransformer; there will be some very very slight floating point diffrences(see tests). I believe these differences are due to my use of the sparse library's implementation of COO and dask's array, as opposed to sklearn's use of scipy's COO and numpy array.\n\nI have also included a 'persist_idf_array' parameter, where the IDF array is persisted for faster transformation after fitting. As with all dask-ml workloads, I recommend persisting the input array before any computation(if you have the memory for it). I also recommend running \"compute_chunk_sizes\" on your dask arrays before running this class.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpr38%2Fdask_tfidf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpr38%2Fdask_tfidf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpr38%2Fdask_tfidf/lists"}