Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pr38/dask_tfidf
A Dask native implementation of 'Term Frequency Inverse Document Frequency' for dask-ml and scikit-learn
https://github.com/pr38/dask_tfidf
dask dask-ml distributed-computing machine-learning python scikit-learn
Last synced: 15 days ago
JSON representation
A Dask native implementation of 'Term Frequency Inverse Document Frequency' for dask-ml and scikit-learn
- Host: GitHub
- URL: https://github.com/pr38/dask_tfidf
- Owner: pr38
- License: bsd-3-clause
- Created: 2023-05-10T20:48:10.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-06-12T20:41:26.000Z (over 1 year ago)
- Last Synced: 2024-11-15T10:49:33.206Z (about 1 month ago)
- Topics: dask, dask-ml, distributed-computing, machine-learning, python, scikit-learn
- Language: Python
- Homepage:
- Size: 7.81 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# dask_tfidf
A Dask native implementation of 'Term Frequency Inverse Document Frequency' for dask-ml and scikit-learnInstall
-------
>pip install dask-tfidfThis project simply includes a DaskTfidfTransformer class, which is more or less a dask equivalent for sklearn' TfidfTransformer.
It assumes a dask array of counted tokens, like the kind that dask_ml's CountVectorizer class creates.
DaskTfidfTransformer, has all the parameters/hyperparameters as sklearn' TfidfTransformer; namley 'norm', 'use_idf', 'smooth_idf' and 'sublinear_tf'.
DaskTfidfTransformer output should be nearly identically to the TfidfTransformer; there will be some very very slight floating point diffrences(see tests). I believe these differences are due to my use of the sparse library's implementation of COO and dask's array, as opposed to sklearn's use of scipy's COO and numpy array.I have also included a 'persist_idf_array' parameter, where the IDF array is persisted for faster transformation after fitting. As with all dask-ml workloads, I recommend persisting the input array before any computation(if you have the memory for it). I also recommend running "compute_chunk_sizes" on your dask arrays before running this class.