{"id":16739595,"url":"https://github.com/abhilash1910/clustertransformer","last_synced_at":"2025-07-31T22:13:10.136Z","repository":{"id":49656017,"uuid":"348426402","full_name":"abhilash1910/ClusterTransformer","owner":"abhilash1910","description":"Topic clustering library built on Transformer embeddings and cosine similarity metrics.Compatible with all BERT base transformers from huggingface.","archived":false,"fork":false,"pushed_at":"2021-06-11T10:30:41.000Z","size":30,"stargazers_count":43,"open_issues_count":0,"forks_count":15,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-18T05:34:50.444Z","etag":null,"topics":["albert","bert-embeddings","clustering","distilbert","embeddings","pytorch","pytorch-implementation","roberta-model","transformer"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/abhilash1910.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.TXT","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-16T16:58:55.000Z","updated_at":"2024-12-27T04:40:18.000Z","dependencies_parsed_at":"2022-09-17T13:00:22.565Z","dependency_job_id":null,"html_url":"https://github.com/abhilash1910/ClusterTransformer","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhilash1910%2FClusterTransformer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhilash1910%2FClusterTransformer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhilash1910%2FClusterTransformer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/abhilash1910%2FClusterTransformer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/abhilash1910","download_url":"https://codeload.github.com/abhilash1910/ClusterTransformer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244880204,"owners_count":20525505,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["albert","bert-embeddings","clustering","distilbert","embeddings","pytorch","pytorch-implementation","roberta-model","transformer"],"created_at":"2024-10-13T00:52:21.485Z","updated_at":"2025-03-21T22:31:32.820Z","avatar_url":"https://github.com/abhilash1910.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ClusterTransformer\n\n\n## A Topic Clustering Library made with Transformer Embeddings :robot:\n\n\nThis is a topic  clustering library built with transformer embeddings and analysing cosine similarity between them. The topics are clustered either by kmeans or agglomeratively depending on the use case, and the embeddings are attained after propagating through any of the Transformers present in [HuggingFace](https://huggingface.co/transformers/pretrained_models.html).The library can be found [here](https://pypi.org/project/ClusterTransformer/).\n\n\n\n## Dependencies\n\n\u003ca href=\"https://pytorch.org/\"\u003ePytorch\u003c/a\u003e\n\n\n\u003ca href=\"https://huggingface.co/transformers/\"\u003eTransformers\u003c/a\u003e\n\n\n\n\n\n## Usability\n\nInstallation is carried out using the pip command as follows:\n\n```python\npip install ClusterTransformer==0.1\n```\n\nFor using inside the Jupyter Notebook or Python IDE:\n\n```python\nimport ClusterTransformer.ClusterTransformer as ct\n```\n\nThe  'ClusterTransformer_test.py' file contains an example of using the Library in this context.\n\n\n### Usability Overview\n\nThe steps to operate this library is as follows:\n\nInitialise the class: ClusterTransformer()\nProvide the input list of sentences: In this case, the quora similar questions dataframe has been taken for experimental purposes.\nDeclare hyperparameters:\n\n- batch_size: Batch size for running model inference\n- max_seq_length: Maximum sequence length for transformer to enable truncation\n- convert_to_numpy: If enabled will return the embeddings in numpy ,else will keep in torch.Tensor\n- normalize_embeddings:If set to True will enable normalization of embeddings.\n- neighborhood_min_size:This is used for neighborhood_detection method and determines the minimum number of entries in each cluster\n- cutoff_threshold:This is used for neighborhood_detection method and determines the cutoff cosine similarity score to cluster the embeddings.\n- kmeans_max_iter: Hyperparameter for kmeans_detection method signifying nnumber of iterations for convergence.\n- kmeans_random_state:Hyperparameter for kmeans_detection method signifying random initial state.\n- kmeans_no_cluster:Hyperparameter for kmeans_detection method signifying number of cluster.\n- model_name:Transformer model name ,any transformer from Huggingface pretrained library\n\nCall the methods:\n\n- ClusterTransfomer.model_inference: For creating the embeddings by running inference through any Transformer library (BERT,Albert,Roberta,Distilbert etc.)Returns a torch.Tensor containing the embeddings.\n- ClusterTransformer.neighborhood_detection: For agglomerative clustering from the embeddings created from the model_inference method.Returns a dictionary.\n- ClusterTransformer.kmeans_detection:For Kmeans clustering from the embeddings created from the model_inference method.Returns a dictionary.\n- ClusterTransformer.convert_to_df: Converts the dictionary from the neighborhood_detection/kmeans_detection methods in a dataframe\n- ClusterTransformer.plot_cluster:Used for simple plotting of the clusters for each text topic.\n\n\n### Code Sample\n\nThe code steps provided in the tab below, represent all the steps required to be done for creating the clusters. The 'compute_topics' method has the following steps:\n\n- Instantiate the object of the ClusterTransformer\n- Specify the transformer name from pretrained transformers\n- Specify the hyperparameters\n- Get the embeddings from 'model_inference' method\n- For agglomerative neighborhood detection use 'neighborhood_detection' method\n- For kmeans detection, use the 'kmeans_detection' method\n- For converting the dictionary to a dataframe use the 'convert_to_df' method\n- For optional plotting of the clusters w.r.t corpus samples, use the 'plot_cluster' method\n\n```python\n%%time\nimport ClusterTransformer.ClusterTransformer as cluster_transformer\n\ndef compute_topics(transformer_name):\n    \n    #Instantiate the object\n    ct=cluster_transformer.ClusterTransformer()\n    #Transformer model for inference\n    model_name=transformer_name\n    \n    #Hyperparameters\n    #Hyperparameters for model inference\n    batch_size=500\n    max_seq_length=64\n    convert_to_numpy=False\n    normalize_embeddings=False\n    \n    #Hyperparameters for Agglomerative clustering\n    neighborhood_min_size=3\n    cutoff_threshold=0.95\n    #Hyperparameters for K means clustering\n    kmeans_max_iter=100\n    kmeans_random_state=42\n    kmeans_no_clusters=8\n    \n    #Sub input data list\n    sub_merged_sent=merged_set[:200]\n    #Transformer (Longformer) embeddings\n    embeddings=ct.model_inference(sub_merged_sent,batch_size,model_name,max_seq_length,normalize_embeddings,convert_to_numpy)\n    #Hierarchical agglomerative detection\n    output_dict=ct.neighborhood_detection(sub_merged_sent,embeddings,cutoff_threshold,neighborhood_min_size)\n    #Kmeans detection\n    output_kmeans_dict=ct.kmeans_detection(sub_merged_sent,embeddings,kmeans_no_clusters,kmeans_max_iter,kmeans_random_state)\n    #Agglomerative clustering\n    neighborhood_detection_df=ct.convert_to_df(output_dict)\n    #KMeans clustering \n    kmeans_df=ct.convert_to_df(output_kmeans_dict)\n    return neighborhood_detection_df,kmeans_df \n```\n\nCalling the driver code:\n\n```python\n%%time\nimport matplotlib.pyplot as plt\nn_df,k_df=compute_topics('bert-large-uncased')\nkg_df=k_df.groupby('Cluster').agg({'Text':'count'}).reset_index()\nng_df=n_df.groupby('Cluster').agg({'Text':'count'}).reset_index()\n\n#Plotting\nfig,(ax1,ax2)=plt.subplots(1,2,figsize=(15,5))\nrng = np.random.RandomState(0)\ns=1000*rng.rand(len(kg_df['Text']))\ns1=1000*rng.rand(len(ng_df['Text']))\nax1.scatter(kg_df['Cluster'],kg_df['Text'],s=s,c=kg_df['Cluster'],alpha=0.3)\nax1.set_title('Kmeans clustering')\nax1.set_xlabel('No of clusters')\nax1.set_ylabel('No of topics')\nax2.scatter(ng_df['Cluster'],ng_df['Text'],s=s1,c=ng_df['Cluster'],alpha=0.3)\nax2.set_title('Agglomerative clustering')\nax2.set_xlabel('No of clusters')\nax2.set_ylabel('No of topics')\nplt.show()\n```\n\n\n## Samples\n\n\n[Colab-Demo](https://colab.research.google.com/drive/18HAoATFfuXGAGzPcOhWgZa0a9B6yOpKK?usp=sharing)\n\n\n[Colab-Demo](https://colab.research.google.com/drive/1sLhuHiUqAUHgsbovA6-kiTaLfwy8QzSn?usp=sharing)\n\n\n[Kaggle Notebook](https://www.kaggle.com/abhilash1910/clustertransformer-topic-modelling-in-transformers/)\n\n\n[Quantum Stat Repository](https://index.quantumstat.com/#clustertransformer)\n\n\n### Images\n\n\u003cimg src=\"https://i.imgur.com/Fjm01Ca.png\"\u003e\n\n\nCluster Images ( Created With Facebook BART)\n\n\n\u003cimg src=\"https://i.imgur.com/y9Oc5XW.png\"\u003e\n\n\n## Contributing\n\nPull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhilash1910%2Fclustertransformer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabhilash1910%2Fclustertransformer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhilash1910%2Fclustertransformer/lists"}