{"id":37071577,"url":"https://github.com/jangedoo/jange","last_synced_at":"2026-01-14T08:23:56.379Z","repository":{"id":46950714,"uuid":"278116917","full_name":"jangedoo/jange","owner":"jangedoo","description":"Easy NLP in Python","archived":false,"fork":false,"pushed_at":"2021-09-21T18:01:45.000Z","size":2158,"stargazers_count":18,"open_issues_count":5,"forks_count":4,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-22T22:09:55.884Z","etag":null,"topics":["clustering","nlp","nlp-library","python3","text","text-classification","text-preprocessing","topic-modeling","visualization"],"latest_commit_sha":null,"homepage":"https://jange.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jangedoo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-07-08T14:49:35.000Z","updated_at":"2025-07-11T00:48:39.000Z","dependencies_parsed_at":"2022-09-02T18:53:30.090Z","dependency_job_id":null,"html_url":"https://github.com/jangedoo/jange","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jangedoo/jange","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jangedoo%2Fjange","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jangedoo%2Fjange/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jangedoo%2Fjange/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jangedoo%2Fjange/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jangedoo","download_url":"https://codeload.github.com/jangedoo/jange/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jangedoo%2Fjange/sbom","scorecard":{"id":505052,"data":{"date":"2025-08-11","repo":{"name":"github.com/jangedoo/jange","commit":"7f6ee5c341f417cae9e60318fb00716b39b02c00"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":1.7,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":0,"reason":"44 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: PYSEC-2021-421 / GHSA-h4m5-qpfp-3mpv","Warn: Project is vulnerable to: PYSEC-2024-48 / GHSA-fj7x-q9j7-g6q6","Warn: Project is vulnerable to: PYSEC-2022-42986 / GHSA-43fp-rhv2-5gv8","Warn: Project is vulnerable to: PYSEC-2023-135 / GHSA-xqr8-7jwr-rhp7","Warn: Project is vulnerable to: PYSEC-2024-60 / GHSA-jjg7-2v4v-x38h","Warn: Project is vulnerable to: GHSA-29gw-9793-fvw7","Warn: Project is vulnerable to: PYSEC-2022-12 / GHSA-pq7m-3gw7-gq5x","Warn: Project is vulnerable to: GHSA-cpwx-vrp4-4pq7","Warn: Project is vulnerable to: PYSEC-2021-66 / GHSA-g3rq-g295-4j3m","Warn: Project is vulnerable to: GHSA-h5c8-rqwp-cp95","Warn: Project is vulnerable to: GHSA-h75v-3vvj-5mfj","Warn: Project is vulnerable to: GHSA-q2x7-8rv6-6q7h","Warn: Project is vulnerable to: PYSEC-2022-288 / GHSA-6hrg-qmvc-2xh8","Warn: Project is vulnerable to: GHSA-33p9-3p43-82vq","Warn: Project is vulnerable to: PYSEC-2022-42974 / GHSA-m678-f26j-3hrp","Warn: Project is vulnerable to: GHSA-6p56-wp2h-9hxr","Warn: Project is vulnerable to: GHSA-fpfv-jqm9-f5jm","Warn: Project is vulnerable to: PYSEC-2021-856","Warn: Project is vulnerable to: PYSEC-2020-92 / GHSA-hj5v-574p-mj7c","Warn: Project is vulnerable to: PYSEC-2022-42969","Warn: Project is vulnerable to: PYSEC-2021-140 / GHSA-9w8r-397f-prfh","Warn: Project is vulnerable to: PYSEC-2023-117 / GHSA-mrwq-x4v8-fh7p","Warn: Project is vulnerable to: PYSEC-2021-141 / GHSA-pq64-v7f5-gqh8","Warn: Project is vulnerable to: PYSEC-2021-112 / GHSA-hwfp-hg2m-9vr2","Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7","Warn: Project is vulnerable to: GHSA-9wx4-h78v-vm56","Warn: Project is vulnerable to: PYSEC-2023-74 / GHSA-j8r2-6x86-q33q","Warn: Project is vulnerable to: PYSEC-2024-110 / GHSA-jw8x-6495-233v","Warn: Project is vulnerable to: PYSEC-2020-108","Warn: Project is vulnerable to: PYSEC-2023-102","Warn: Project is vulnerable to: PYSEC-2023-114","Warn: Project is vulnerable to: GHSA-753j-mpmx-qq6g","Warn: Project is vulnerable to: GHSA-7cx3-6m66-7c5m","Warn: Project is vulnerable to: GHSA-8w49-h785-mj3c","Warn: Project is vulnerable to: PYSEC-2023-75 / GHSA-hj3f-6gcp-jg8j","Warn: Project is vulnerable to: GHSA-qppv-j76h-2rpx","Warn: Project is vulnerable to: GHSA-w235-7p84-xx57","Warn: Project is vulnerable to: GHSA-g7vv-2v7x-gj9p","Warn: Project is vulnerable to: GHSA-34jh-p97f-mpxf","Warn: Project is vulnerable to: PYSEC-2023-212 / GHSA-g4mx-q9vg-27p4","Warn: Project is vulnerable to: GHSA-pq67-6m6q-mj2v","Warn: Project is vulnerable to: PYSEC-2021-108 / GHSA-q2q7-5pp4-w6pg","Warn: Project is vulnerable to: PYSEC-2023-192 / GHSA-v845-jxx5-vc9f","Warn: Project is vulnerable to: GHSA-jfmj-5v4g-7637"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 5 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-19T23:05:45.772Z","repository_id":46950714,"created_at":"2025-08-19T23:05:45.772Z","updated_at":"2025-08-19T23:05:45.772Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28413867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:16:59.381Z","status":"ssl_error","status_checked_at":"2026-01-14T08:13:45.490Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["clustering","nlp","nlp-library","python3","text","text-classification","text-preprocessing","topic-modeling","visualization"],"created_at":"2026-01-14T08:23:55.856Z","updated_at":"2026-01-14T08:23:56.357Z","avatar_url":"https://github.com/jangedoo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# jange\n[![Build Status](https://travis-ci.org/jangedoo/jange.svg?branch=master)](https://travis-ci.org/jangedoo/jange)\n------\njange is an easy to use NLP library for Python. It provides a high level API for commonly applications in NLP. It is based on popular libraries like `pandas`, `scikit-learn`, `spacy`, and `plotly`.\n\nTo get started, install using pip.\n```\npip install jange\n```\n\n# Overview\nFor common NLP applications, we clean the data, extract the features and apply some ML algorithm on the features.We apply some transformation on the raw input data to get the results we want. **jange** organizes these transformations as a series of operation we do on the input. The high level API for these transformation are easy to read and reason with. Take a look at the example below for clustering. Even without any explanation you should be able to read and understand what is happening without refering to a hour length tutorial or trying to wrap your head around multi-dimensional array slicing and dicing. Let's not forget the pain to migrate the code from prototyping to production. **jange** tries to simplify the transition from experimental phase to production use.\n\n\n```python\n# %% Load data\nfrom jange import ops, stream, vis\n\nds = stream.from_csv(\n    \"https://raw.githubusercontent.com/jangedoo/jange/master/dataset/bbc.csv\",\n    columns=\"news\",\n    context_column=\"type\",\n)\n\n# %% Extract clusters\n# Extract clusters\nresult_collector = {}\nclusters_ds = ds.apply(\n    ops.text.clean.pos_filter(\"NOUN\", keep_matching_tokens=True),\n    ops.text.encode.tfidf(max_features=5000, name=\"tfidf\"),\n    ops.cluster.minibatch_kmeans(n_clusters=5),\n    result_collector=result_collector,\n)\n\n# %% Get features extracted by tfidf\nfeatures_ds = result_collector[clusters_ds.applied_ops.find_by_name(\"tfidf\")]\n\n# %% Visualization\nreduced_features = features_ds.apply(ops.dim.tsne(n_dim=2))\nvis.cluster.visualize(reduced_features, clusters_ds)\n\n# visualization looks good, lets export the operations\nwith ops.utils.disable_training(cluster_ds.applied_ops) as cluster_ops:\n    with open(\"cluster_ops.pkl\", \"wb\") as f:\n        pickle.dump(cluster_ops, f)\n\n# in_another_file.py\n# load the saved operations and apply on a new stream to retrieve the clusters\nwith open(\"cluster_ops.pkl\", \"rb\") as f:\n    cluster_ops = pickle.load(f)\n\nclusters_ds = input_ds.apply(cluster_ops)\n```\n![Cluster](https://sanjayasubedi.com.np/assets/images/nlp/clustering/cluster_jange.png)\n\n\n\nLooks convincing?\n\n# What can jange do for me?\nThe idea behind jange is for rapid prototyping and **deployment**. Jange supports\n\n- Data cleaning: remove stop words, emails, links, numbers, filter tokens based on POS or any filter operation using spacy's token matcher patterns. It provides a high-level api to spacy's TokenMatcher.\n- Text Encoding : Provides high level API for encoding texts as one-hot, count or tf-idf features using scikit-learn model\n- Embedding : Document embedding based on spacy's language model that captures semantics of the text\n- Clustering: High level API for several clustering algorithsm in scikit-learn library\n- Topic modeling: High level API for commonly used topic modeling algorithms (NMF, LDA)\n- Nearest Neighbors : High level API for finding similar pair or groups of similar items\n- Classification : High level API to train spacy's model or many of scikit-learn's classifiers\n- Dimension reduction: High level API for algorithms used to reduce dimension of feature space. Useful for visualization (tsne, pca) or compression\n- Extraction : High level API to extract sentences, or summary from texts\nAnd many more including visualization, operation persistence and quick apps.\n\n# Basic Concept\n## DataStream\nDataStream is a holder of your data. The data can be lazily loaded or can be in memory. A DataStream is nothing more than a list of items together with some context(optional). For example,\n```python\nfrom jange import stream\n# create stream from any python object including lists, numpy array, generators etc.\nds = stream.DataStream(items=[\"Product 1\", \"Product 2\"], context=[\"pid1\", \"pid2\"])\n# few helper functions\nstream.from_csv(\"path/to/csv\")\nstream.from_df(df)\n```\n`ds` is a data stream that holds your data along with some context. In this case the database id of the products. The idea behind context is that it holds some metadata about the items you pass. If you don't pass anything to the context, then jange will internally create context values for each item. DataStream also maintains information about what operations have been applied to it in a variable `applied_ops`. For example, if you applied few cleaning, one tf-df and a topic modeling operation to an input stream, the final `DataStream` containing the output of topic modeling will know about all operations that were applied from the beginning. You can apply the same operations to a new raw input stream and exactly the same operations will be applied to the new input.\n\n\n## Operations\nTransformations to the input data are done by `Operation` in jange. Each operation takes in a DataStream and produces a DataStream. One or more operations are applied to a DataStream. Each operation will execute and pass the results to the next operation. Example below shows how you can apply \ndifferent operations. Of course, the output of an operation should be compatible with the input the next operation is expecting.\n\nOperations in **jange** are available under `ops` sub package. They are nicely organized into modules depending on their scope. For example, operations that work on input of texts are under `ops.text`. For cleaning the operations are under `ops.text.clean` and for encoding texts into vectors or embeddings, `ops.text.encode` or `ops.text.embedding` can be used.\n\nFor clustering, topic modeling, classification etc. they can be found under `ops.cluster`, `ops.topic`, `ops.classfy` etc.\n```python\ninput_ds = stream.from_csv(\n    \"https://raw.githubusercontent.com/jangedoo/jange/master/dataset/bbc.csv\",\n    columns=\"news\",\n    context_column=\"type\",\n)\nclusters_ds = input_ds.apply(\n    ops.text.clean.filter_pos(\"NOUN\", keep_matching_tokens=True),\n    ops.text.encode.tfidf(max_features=5000, name=\"tfidf\"),\n    ops.cluster.minibatch_kmeans(n_clusters=5)\n)\n\n# once we are happy with results save the operations to disk\nwith ops.utils.disable_training(cluster_ds.applied_ops) as cluster_ops:\n    with open(\"cluster_ops.pkl\", \"wb\") as f:\n        pickle.dump(cluster_ops, f)\n\n# in_another_file.py\n# load the saved operations and apply on a new stream to retrieve the clusters\nwith open(\"cluster_ops.pkl\", \"rb\") as f:\n    cluster_ops = pickle.load(f)\n\nclusters_ds = input_ds.apply(cluster_ops) # WOW! this easy for production? 👍\n```\n\n### How does it work?\n`Operation` has a very simple interface with one method `run(ds: DataStream) -\u003e DataStream`. When it processes input DataStream, and produces an output, it will add itself and the `applied_ops` of input DataStream to the `applied_ops` of output DataStream. From the code above, if we print out `cluster_ds.applied_ops`, we'll see a list of 3 operations so we know exactly what operations were applied to produce this output. Each operation will also make sure the context is passed to the output appropriately. This is important when some operations discard one or more items from the input. If you solely rely on array indexing, the mapping of output to the original input index will no longer be valid as some items have been removed and you don't know which output maps to which original input anymore. Context helps to maintain the mapping with original data.\n\n What about operations where we need to train? These operations use `TrainableMixin` which has an attribute `should_train`. By default it is True, so when you run it, any trainable operation will train the underlying model (sklearn's or spacy's models) and then do the predictions. But during production, you don't want to train so a helper function `ops.utils.disable_training` will set `should_train` to False for all trainable operations. As shown in the example above, you can save these operations to disk and next time you load it, you can run these operations with out training any \"trainable\" operations again.\n\nCare has been taken in making sure that the operations can be pickled without saving unnecessary data. For example, instead of pickling spacy's language model, the operation only saves the path to the model and when operation is unpicked, spacy's model is loaded from that path. Also, the model loading is cached, so attempts to load the same spacy model will use the cached version instead of loading from the disk.\n\nSince the API is so simple, you can easily extend to fit your requirements.\n\nCheck out the [API Reference](https://jange.readthedocs.io/en/latest/api/index.html) for more details.\n\n# Installation\n```\npip install jange\n```\n\n## From source\nThis project uses poetry to manage dependencies. If you don't already have poetry installed then go to https://python-poetry.org/docs/#installation for instructions on how to install it for your OS.\n\nOnce poetry is installed, from the root directory of this project, run `poetry install`. It will create a virtual environment for this project and install the necessary dependencies (including dev dependencies).\n\n\n# Contributions 👩‍💻\nThis library is in a very early stage. Your perspective on how things could be done or improved would be greatly appreciated. Since this is in early stage, you'll most probably encounter some bugs and issues. Please let us know by opening an issue or if you know Python then you can contribute too!","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjangedoo%2Fjange","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjangedoo%2Fjange","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjangedoo%2Fjange/lists"}