{"id":21170202,"url":"https://github.com/dcarpintero/taxonomy-completion","last_synced_at":"2026-02-01T03:01:17.582Z","repository":{"id":248871736,"uuid":"823639371","full_name":"dcarpintero/taxonomy-completion","owner":"dcarpintero","description":"Taxonomy Completion with Embedding Quantization and an LLM-based Pipeline: A Case Study in Computational Linguistics","archived":false,"fork":false,"pushed_at":"2024-07-22T10:10:06.000Z","size":11259,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-13T19:02:42.879Z","etag":null,"topics":["alibi","anthropic-claude","clustering","dimensionality-reduction","embeddings","hdbscan","huggingface","huggingface-transformers","langchain","mistral-7b","natural-language-processing","prompt-engineering","pydantic","python","quantization","scalar-quantization","sentence-transformers","taxonomy-construction","topic-modeling","umap"],"latest_commit_sha":null,"homepage":"https://huggingface.co/blog/dcarpintero/taxonomy-completion","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dcarpintero.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-03T12:21:10.000Z","updated_at":"2024-08-28T14:17:16.000Z","dependencies_parsed_at":"2025-01-21T10:43:18.520Z","dependency_job_id":"18aba331-6f7c-46cd-a5a6-5fb5905b99fb","html_url":"https://github.com/dcarpintero/taxonomy-completion","commit_stats":null,"previous_names":["dcarpintero/taxonomy-completion"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dcarpintero/taxonomy-completion","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftaxonomy-completion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftaxonomy-completion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftaxonomy-completion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftaxonomy-completion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dcarpintero","download_url":"https://codeload.github.com/dcarpintero/taxonomy-completion/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcarpintero%2Ftaxonomy-completion/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28965436,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T02:14:24.993Z","status":"ssl_error","status_checked_at":"2026-02-01T02:13:55.706Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alibi","anthropic-claude","clustering","dimensionality-reduction","embeddings","hdbscan","huggingface","huggingface-transformers","langchain","mistral-7b","natural-language-processing","prompt-engineering","pydantic","python","quantization","scalar-quantization","sentence-transformers","taxonomy-construction","topic-modeling","umap"],"created_at":"2024-11-20T15:57:08.128Z","updated_at":"2026-02-01T03:01:17.567Z","avatar_url":"https://github.com/dcarpintero.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Taxonomy Completion with Embedding Quantization and an LLM-based Pipeline: A Case Study in Computational Linguistics\n\n[![GitHub license](https://img.shields.io/github/license/dcarpintero/taxonomy-completion)](https://github.com/dcarpintero/taxonomy-completion/blob/main/LICENSE)\n[![GitHub contributors](https://img.shields.io/github/contributors/dcarpintero/taxonomy-completion.svg)](https://GitHub.com/dcarpintero/taxonomy-completion/graphs/contributors/)\n[![GitHub issues](https://img.shields.io/github/issues/dcarpintero/taxonomy-completion.svg)](https://GitHub.com/dcarpintero/taxonomy-completion/issues/)\n[![GitHub pull-requests](https://img.shields.io/github/issues-pr/dcarpintero/taxonomy-completion.svg)](https://GitHub.com/dcarpintero/taxonomy-completion/pulls/)\n[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dcarpintero/taxonomy-completion/blob/main/nb.taxonomy-completion-with-embedding-quantization-and-llms.ipynb)\n\n[![GitHub watchers](https://img.shields.io/github/watchers/dcarpintero/taxonomy-completion.svg?style=social\u0026label=Watch)](https://GitHub.com/dcarpintero/taxonomy-completion/watchers/)\n[![GitHub forks](https://img.shields.io/github/forks/dcarpintero/taxonomy-completion.svg?style=social\u0026label=Fork)](https://GitHub.com/dcarpintero/taxonomy-completion/network/)\n[![GitHub stars](https://img.shields.io/github/stars/dcarpintero/taxonomy-completion.svg?style=social\u0026label=Star)](https://GitHub.com/dcarpintero/taxonomy-completion/stargazers/)\n\n## Introduction\n\nThe ever-growing volume of research publications necessitates efficient methods for structuring academic knowledge. This task typically involves developing a supervised underlying scheme of classes and allocating publications to the most relevant class. In this article, we implement an end-to-end automated solution using embedding quantization and a Large Language Model (LLM) pipeline.  Our case study starts with a dataset of [25,000 arXiv publications](https://huggingface.co/datasets/dcarpintero/arxiv.cs.CL.25k) from Computational Linguistics (cs.CL), published before July 2024, which we organize under a novel scheme of classes.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg style=\"margin: 0 auto; display: block;\" src=\"https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/Kp5GNW9dUKYQZtAKSAm3q.png\"\u003e\n\u003c/p\u003e\n\n## Methodology\n\nOur approach centers on three key tasks: (i) unsupervised clustering of the arXiv dataset into related collections, (ii) discovering the latent thematic structures within each cluster, and (iii) creating a candidate taxonomy scheme based on said thematic structures.\n\nAt its core, the clustering task requires identifying a sufficient number of similar examples within an *unlabeled* dataset.\nThis is a natural task for embeddings, as they capture semantic relationships in a corpus and can be provided as input features to a clustering algorithm for establishing similarity links among examples. We begin by transforming the (*title*:*abstract*) pairs of our dataset into an embeddings representation using [Jina-Embeddings-v2](https://arxiv.org/abs/2310.19923), a BERT-ALiBi based attention model. And applying scalar quantization using both [Sentence Transformers](https://www.sbert.net/) and a custom implementation.\n\nFor clustering, we run [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html) in a reduced dimensional space, comparing the results using `eom` and `leaf` clustering methods. Additionally, we examine whether using `(u)int8` embeddings quantization instead of `float32` representations affects this process.\n\nTo uncover latent topics within each cluster of arXiv publications, we combine [LangChain](https://www.langchain.com/) and [Pydantic](https://docs.pydantic.dev/) with [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) (and [GPT-4o](https://platform.openai.com/docs/models/gpt-4o), included for comparison) into an LLM-pipeline. The output is then incorporated into a refined prompt template that guides [Claude Sonnet 3.5](https://docs.anthropic.com/en/docs/welcome) in generating a hierarchical taxonomy.\n\nThe results hint at 35 emerging research topics, wherein each topic comprises at least `100` publications. These are organized within 7 parent classes in the field of Computational Linguistics (cs.CL). This approach may serve as a baseline for automatically generating hierarchical candidate schemes in high-level [arXiv categories](https://arxiv.org/category_taxonomy) and efficiently completing taxonomies, addressing the challenge posed by the increasing volume of academic literature.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg style=\"margin: 0 auto; display: block;\" src=\"https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/fbCiM9DfjvDFThUQYngIO.png\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eTaxonomy Completion of Academic Literature with Embedding Quantization and an LLM-Pipeline\u003c/p\u003e\n\n## 1. Embedding Transformation\n\nEmbeddings are numerical representations of real-world objects like text, images, and audio that encapsulate semantic information of the data they represent. They are used by AI models to understand complex knowledge domains in downstream applications such as clustering, information retrieval, and semantic understanding tasks, among others.\n\n#### Supporting Large Sequences\n\nWe will map (*title*:*abstract*) pairs from arXiv publications to a 768-dimensional space using [Jina-Embeddings-v2](https://arxiv.org/abs/2310.19923) [1], an open-source text embedding model capable of accommodating up to 8192 tokens. This provides a sufficiently large sequence length for titles, abstracts, and other document sections that might be relevant. To overcome the conventional 512-token limit  present in other models, Jina-Embeddings-v2 incorporates bidirectional [ALiBi](https://arxiv.org/abs/2108.12409) [2] into the BERT framework. ALiBi (Attention with Linear Biases) enables input length extrapolation (i.e., sequences exceeding 2048 tokens) by encoding positional information directly within the self-attention layer, instead of introducing positional embeddings. In practice, it biases query-key attention scores with a penalty that is proportional to their distance, favoring stronger mutual attention between proximate tokens.\n\n#### Encoding with Sentence Transformers\n\nThe first step to using the [Jina-Embeddings-v2](https://huggingface.co/jinaai/jina-embeddings-v2-base-en) model is to load it through [Sentence Transformers](https://www.SBERT.net), a framework for accessing state-of-the-art models that is available at the [Hugging Face Hub](https://huggingface.co/models?library=sentence-transformers\u0026sort=downloads):\n\n```python\nfrom sentence_transformers import SentenceTransformer\nmodel = SentenceTransformer('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)\n```\n\nWe now encode (*title*:*abstract*) pairs of our [dataset]() using `batch_size = 64`. This allows for parallel computation on hardware accelerators like GPUs (albeit at the cost of requiring more memory):\n\n```python\nfrom datasets import load_dataset\nds = load_dataset(\"dcarpintero/arxiv.cs.CL.25k\", split=\"train\")\n\ncorpus = [title + ':' + abstract for title, abstract in zip(ds['title'], ds['abstract'])]\nf32_embeddings = model.encode(corpus,\n                              batch_size=64,\n                              show_progress_bar=True)\n```\n\n#### Computing Semantic Similarity\n\nThe semantic similarity between corpora can now be trivially computed as the inner product of embeddings. In the following heat map, each entry [x, y] is colored based on said embeddings product for exemplary '*title*' sentences [x] and [y].\n\n\u003cp align=\"center\"\u003e\n  \u003cimg style=\"margin: 0 auto; display: block;\" src=\"https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/4djmELIe2LkZ8_Tofc91Q.png\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eSemantic Similary in \u003cem\u003ecs.CL arXiv-titles\u003c/em\u003e using Embeddings\u003c/p\u003e\n\n## 2. Embedding Quantization for Memory Saving\n\nScaling up embeddings can be challenging. Currently, state-of-the-art models represent each embedding as `float32`, which requires 4 bytes of memory. Given that [Jina-Embeddings-v2](https://arxiv.org/abs/2310.19923) maps text to a 768-dimensional space, the memory requirements for our dataset would be around 73 MB, without indexes and other metadata related to the publication records:\n\n```python\n25,000 embeddings * 768 dimensions/embedding * 4 bytes/dimension = 76,800,000 bytes\n76,800,000 bytes / (1024^2) ≈ 73.24 MB\n```\n\nHowever, working with a larger dataset might increase significantly the memory requirements and associated costs:\n\n| Embedding\u003cbr\u003eDimension | Embedding\u003cbr\u003eModel            | 2.5M\u003cbr\u003eArXiv Abstracts      | 60.9M\u003cbr\u003eWikipedia Pages | 100M\u003cbr\u003eEmbeddings |\n|------------------------|-------------------------------|------------------------------|-----------------------|------------------------------|\n| 384                    | all-MiniLM-L12-v2             | 3.57 GB                      | 85.26 GB              | 142.88 GB                    |\n| 768                    | all-mpnet-base-v2             | 7.15 GB                      | 170.52 GB             | 285.76 GB                    |\n| 768                    | jina-embeddings-v2            | 7.15 GB                      | 170.52 GB             | 285.76 GB                    |\n| 1536                   | openai-text-embedding-3-small | 14.31 GB                     | 341.04 GB             | 571.53 GB                    |\n| 3072                   | openai-text-embedding-3-large | 28.61 GB                     | 682.08 GB             | 1.143 TB                   |\n\nA technique used to achieve memory saving is *Quantization*. The intuition behind this approach is that we can discretize  floating-point values by mapping their range [`f_max`, `f_min`] into a smaller range of fixed-point numbers [`q_max`, `q_min`], and linearly distributing all values between these ranges. In practice, this typically reduces the precision of a 32-bit floating-point to lower bit widths like 8-bits (scalar quantization) or 1-bit values (binary quantization).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg style=\"margin: 0 auto; display: block;\" src=\"https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/8PF8uD8wgk12Uuejddhnw.png\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eScalar Embedding Quantization - from \u003cem\u003efloat32\u003c/em\u003e to \u003cem\u003e(u)int8\u003c/em\u003e\u003c/p\u003e\n\nBy plotting the frequency distribution of the *Jina-generated* embeddings, we observe that the values are indeed concentrated around a relatively narrow range [-2.0, +2.0]. This means we can effectively map `float32` values to 256 `(u)int8` buckets without significant loss of information:\n\n```python\nimport matplotlib.pyplot as plt\n\nplt.hist(f32_embeddings.flatten(), bins=250, edgecolor='C0')\nplt.xlabel('float-32 jina-embeddings-v2')\nplt.title('distribution')\nplt.show()\n```\n\n\u003cp align=\"center\"\u003e\n  \u003cimg style=\"margin: 0 auto; display: block;\" src=\"https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/Cx578eTvr8z3cj7yX7Nn5.png\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eOriginal \u003cem\u003efloat32 jina-embeddings-v2\u003c/em\u003e distribution\u003c/p\u003e\n\nWe can calculate the exact `[min, max]` values of the distribution:\n\n```python\n\u003e\u003e\u003e np.min(f32_embeddings), np.max(f32_embeddings)\n(-2.0162134, 2.074683)\n```\n\nThe first step to implementing scalar quantization is to define a calibration set of embeddings. A typical starting point is a subset of 10k embeddings, which in our case would cover nearly 99.98% of the original `float32` embedding values. The use of calibration is intended to obtain representative `f_min` and `f_max` values along each dimension to reduce the computational overhead and potential issues caused by outliers that might appear in larger datasets.\n\n```python\ndef calibration_accuracy(embeddings: np.ndarray, k: int = 10000) -\u003e float:\n  calibration_embeddings = embeddings[:k]\n  f_min = np.min(calibration_embeddings, axis=0)\n  f_max = np.max(calibration_embeddings, axis=0)\n\n  # Calculate percentage in range for each dimension\n  size = embeddings.shape[0]\n  avg = []\n  for i in range(embeddings.shape[1]):\n      in_range = np.sum((embeddings[:, i] \u003e= f_min[i]) \u0026 (embeddings[:, i] \u003c= f_max[i]))\n      dim_percentage = (in_range / size) * 100\n      avg.append(dim_percentage)\n\n  return np.mean(avg)\n\nacc = calibration_accuracy(f32_embeddings, k=10000)\nprint(f\"Average percentage of embeddings within [f_min, f_max] calibration: {acc:.5f}%\")\n\u003e\u003e\u003e Average percentage of embeddings within [f_min, f_max] calibration: 99.98636%\n```\n\nThe second and third steps of scalar quantization — *computing scales and zero point*, and *encoding* — can be easily applied with [Sentence Transformers](https://www.sbert.net/), resulting in a 4x memory saving compared to the original `float32` representation. Moreover, we will also benefit from faster arithmetic operations since matrix multiplication can be performed more quickly with integer arithmetic. \n\n```python\nfrom sentence_transformers.quantization import quantize_embeddings\n\n# quantization is applied in a post-processing step\nint8_embeddings = quantize_embeddings(\n    np.array(f32_embeddings),\n    precision=\"int8\",\n    calibration_embeddings=np.array(f32_embeddings[:10000]),\n)\n```\n\n```python\nf32_embeddings.dtype, f32_embeddings.shape, f32_embeddings.nbytes\n\u003e\u003e\u003e (dtype('float32'), (25107, 768), 77128704) # 73.5 MB\n\nint8_embeddings.dtype, int8_embeddings.shape, int8_embeddings.nbytes\n\u003e\u003e\u003e (dtype('int8'), (25107, 768), 19282176)    # 18.3 MB\n\n# calculate compression\n(f32_embeddings.nbytes - int8_embeddings.nbytes) / f32_embeddings.nbytes * 100\n\u003e\u003e\u003e 75.0\n```\n\nFor completeness, we implement a scalar quantization method to illustrate those three steps:\n\n```python\ndef scalar_quantize_embeddings(embeddings: np.ndarray,\n                               calibration_embeddings: np.ndarray) -\u003e np.ndarray:\n\n    # Step 1: Calculate [f_min, f_max] per dimension from the calibration set \n    f_min = np.min(calibration_embeddings, axis=0)\n    f_max = np.max(calibration_embeddings, axis=0)\n\n    # Step 2: Map [f_min, f_max] to [q_min, q_max] =\u003e (scaling factors, zero point)\n    q_min = 0\n    q_max = 255\n    scales = (f_max - f_min) / (q_max - q_min)\n    zero_point = 0 # uint8 quantization maps inherently min_values to zero\n\n    # Step 3: encode (scale, round)\n    quantized_embeddings = ((embeddings - f_min) / scales).astype(np.uint8)\n\n    return quantized_embeddings\n```\n\n```python\ncalibration_embeddings = f32_embeddings[:10000]\nbeta_uint8_embeddings = scalar_quantize_embeddings(f32_embeddings, calibration_embeddings)\n```\n\n```python\nbeta_uint8_embeddings[5000][64:128].reshape(8, 8)\n\narray([[187, 111,  96, 128, 116, 129, 130, 122],\n       [132, 153,  72, 136,  94, 120, 112,  93],\n       [143, 121, 137, 143, 195, 159,  90,  93],\n       [178, 189, 143,  99,  99, 151,  93, 102],\n       [179, 104, 146, 150, 176,  94, 148, 118],\n       [161, 138,  90, 122,  93, 146, 140, 129],\n       [121, 115, 153, 118, 107,  45,  70, 171],\n       [207,  53,  67, 115, 223, 105, 124, 158]], dtype=uint8)\n```\n\n\nWe will continue with the version of the embeddings that have been quantized using Sentence Transformers (our custom implementation is also included in the results analysis):\n\n```python\n# `f32_embeddings` =\u003e if you prefer to not use quantization\n# `beta_uint8_embeddings` =\u003e to check our custom implemention\nembeddings = int8_embeddings \n```\n\n## 3. Projecting Embeddings for Dimensionality Reduction\n\nIn this section, we perform a two-stage projection of (*title*:*abstract*) embedding pairs from their original high-dimensional space (768) to lower dimensions, namely:\n- `5 dimensions` for reducing computational complexity during clustering, and \n- `2 dimensions` for enabling visual representation in `(x, y)` coordinates.\n\nFor both projections, we employ [UMAP](https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction#Uniform_manifold_approximation_and_projection) [3], a popular dimensionality reduction technique known for its effectiveness in preserving both the local and global data structures. In practice, this makes it a preferred choice for handling complex datasets with high-dimensional embeddings:\n\n```python\nimport umap\n\nembedding_5d = umap.UMAP(n_neighbors=100, # consider 100 nearest neighbors for each point\n                         n_components=5,  # reduce embedding space from 768 to 5 dimensions\n                         min_dist=0.1,    # maintain local and global balance\n                         metric='cosine').fit_transform(embeddings)\n\nembedding_2d = umap.UMAP(n_neighbors=100,\n                         n_components=2,\n                         min_dist=0.1,\n                         metric='cosine').fit_transform(embeddings)\n```\n\nNote that when we apply HDBSCAN clustering in the next step, the clusters found will be influenced by how UMAP preserved the local structures. A smaller `n_neighbors` value means UMAP will focus more on local structures, whereas a larger value allows capturing more global representations, which might be beneficial for understanding overall patterns in the data.\n\n## 4. Semantic Clustering\n\nThe reduced (*title*:*abstract*) embeddings can now be used as input features of a clustering algorithm, enabling the identification of related categories based on embedding distances.\n\nWe have opted for [HDBSCAN](https://en.wikipedia.org/wiki/HDBSCAN) (Hierarchical Density-Based Spatial Clustering of Applications with Noise) [4], an advanced clustering algorithm that extends DBSCAN by adapting to varying density clusters. Unlike K-Means which requires pre-specifying the number of clusters, HDBSCAN has only one important hyperparameter, `n`, which establishes the minimum number of examples to include in a cluster. \n\nHDBSCAN works by first transforming the data space according to the density of the data points, making denser regions (areas where data points are close together in high numbers) more attractive for cluster formation. The algorithm then builds a hierarchy of clusters based on the minimum cluster size established by the hyperparameter `n`. This allows it to distinguish between noise (sparse areas) and dense regions (potential clusters). Finally, HDBSCAN condenses this hierarchy to derive the most persistent clusters, identifying clusters of different densities and shapes. As a density-based method, it can also detect outliers.\n\n```python\nimport hdbscan\n\nhdbs = hdbscan.HDBSCAN(min_cluster_size=100,            # conservative clusters' size\n                       metric='euclidean',              # points distance metric\n                       cluster_selection_method='leaf') # favour fine grained clustering\nclusters = hdbs.fit_predict(embedding_5d)               # apply HDBSCAN on reduced UMAP\n```\n\nThe `cluster_selection_method` determines how HDBSCAN selects flat clusters from the tree hierarchy. In our case, using `eom` (Excess of Mass) cluster selection method in combination with embedding quantization tended to create a few larger, less specific clusters. These clusters would have required a further *reclustering process* to extract meaningful latent topics. Instead, by switching to the `leaf` selection method, we guided the algorithm to select leaf nodes from the cluster hierarchy, which produced a more fine-grained clustering compared to the Excess of Mass method:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg style=\"margin: 0 auto; display: block;\" src=\"https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/20_VarYLBZxlND0vtDlLy.png\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eHDBSCAN \u003cem\u003eeom\u003c/em\u003e \u0026 \u003cem\u003eleaf\u003c/em\u003e clustering method comparison using \u003cem\u003eint8-embedding-quantization\u003c/em\u003e\u003c/p\u003e\n\n## 5. Uncovering Latent Topics with an LLM-Pipeline\n\nHaving performed the clustering step, we now illustrate how to infer the latent topic of each cluster by combining an LLM such as [Mistral-7B-Instruct](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) [5] with [Pydantic](https://docs.pydantic.dev/) and [LangChain](https://www.langchain.com/) to create an LLM pipeline that generates output in a composable structured format.\n\n### 5.1 Pydantic Model\n\n[Pydantic Models](https://docs.pydantic.dev/latest/concepts/models/) are classes that derive from `pydantic.BaseModel`, defining fields as type-annotated attributes. They are similar to `Python` dataclasses. However, they have been designed with subtle but significant differences that optimize various operations such as validation, serialization, and `JSON` schema generation. Our `Topic` class defines a field named `label`. This will generate LLM output in a structured format, rather than a free-form text block, facilitating easier processing and analysis.\n\n```python\nfrom pydantic import BaseModel, Field\n\nclass Topic(BaseModel):\n    \"\"\"\n    Pydantic Model to generate an structured Topic Model\n    \"\"\"\n    label: str = Field(..., description=\"Identified topic\")\n```\n\n### 5.2 Langchain Prompt Template\n\n[LangChain Prompt Templates](https://python.langchain.com/v0.2/docs/concepts/#prompt-templates) are pre-defined recipes for translating user input and parameters into instructions for a language model. We define here the prompt for our intended task:\n\n```python\nfrom langchain_core.prompts import PromptTemplate\n\ntopic_prompt = \"\"\"\n  You are a helpful research assistant. Your task is to analyze a set of research paper\n  titles related to Natural Language Processing, and determine the overarching topic. \n            \n  INSTRUCTIONS:\n\n  1. Based on the titles provided, identify the most relevant topic:\n    - Ensure the topic is concise and clear.\n            \n  2. Format Respose:\n    - Ensure the title response is in JSON as in the 'OUTPUT OUTPUT' section below.\n    - No follow up questions are needed.\n\n  OUTPUT FORMAT:\n\n  {{\"label\": \"Topic Name\"}}\n\n  TITLES:\n  {titles}\n  \"\"\"\n```\n\n### 5.3 Inference Chain using LangChain Expression Language\n\nLet's now compose a topic modeling pipeline using [LangChain Expression Language (LCEL)](https://python.langchain.com/docs/expression_language/) to render our prompt template into LLM input, and parse the inference output as `JSON`:\n\n```python\nfrom langchain.chains import LLMChain\nfrom langchain_huggingface import HuggingFaceEndpoint\nfrom langchain_core.output_parsers import PydanticOutputParser\n\nfrom typing import List\n\ndef TopicModeling(titles: List[str]) -\u003e str:\n    \"\"\"\n    Infer the common topic of the given titles w/ LangChain, Pydantic, OpenAI\n    \"\"\"\n    repo_id = \"mistralai/Mistral-7B-Instruct-v0.3\"\n    llm = HuggingFaceEndpoint(\n        repo_id=repo_id,\n        temperature=0.2,\n        huggingfacehub_api_token=os.environ[\"HUGGINGFACEHUB_API_TOKEN\"]\n    )\n    prompt = PromptTemplate.from_template(topic_prompt)\n    parser = PydanticOutputParser(pydantic_object=Topic)\n\n    topic_chain = prompt | llm | parser\n    return topic_chain.invoke({\"titles\": titles})\n```\n\nTo enable the model to infer the topic of each cluster, we include a subset of 25 paper titles from each cluster as part of the LLM input:\n\n```python\ntopics = []\nfor i, cluster in df.groupby('cluster'):\n    titles = cluster['title'].sample(25).tolist()\n    topic = TopicModeling(titles)\n    topics.append(topic.label)\n```\n\nLet's assign each arXiv publication to its corresponding cluster:\n\n```python\nn_clusters = len(df['cluster'].unique())\n\ntopic_map = dict(zip(range(n_clusters), topics))\ndf['topic'] = df['cluster'].map(topic_map)\n```\n\n## 6. Generating a Taxonomy\n\nTo create a hierarchical taxonomy, we craft a prompt to guide [Claude Sonnet 3.5](https://docs.anthropic.com/en/docs/welcome) in organizing the identified research topics corresponding to each cluster into a hierarchical scheme:\n\n```python\nfrom langchain_core.prompts import PromptTemplate\n\ntaxonomy_prompt = \"\"\"\n    Create a comprehensive and well-structured taxonomy\n    for the ArXiv cs.CL (Computational Linguistics) category.\n    This taxonomy should organize subtopics in a logical manner.\n\n    INSTRUCTIONS:\n\n    1. Review and Refine Subtopics:\n      - Examine the provided list of subtopics in computational linguistics.\n      - Ensure each subtopic is clearly defined and distinct from others.\n\n    2. Create Definitions:\n      - For each subtopic, provide a concise definition (1-2 sentences).\n\n    3. Develop a Hierarchical Structure:\n      - Group related subtopics into broader categories.\n      - Create a multi-level hierarchy, with top-level categories and nested subcategories.\n      - Ensure that the structure is logical and intuitive for researchers in the field.\n\n    4. Validate and Refine:\n      - Review the entire taxonomy for consistency, completeness, and clarity.\n\n    OUTPUT FORMAT:\n\n    - Present the final taxonomy in a clear, hierarchical format, with:\n\n      . Main categories\n        .. Subcategories\n          ... Individual topics with their definitions\n\n    SUBTOPICS:\n    {taxonomy_subtopics}\n    \"\"\"\n```\n\n## 7. Results\n\n### 7.1 Clustering Analysis\n\nLet's create an interactive scatter plot:\n\n```python\nchart = alt.Chart(df).mark_circle(size=5).encode(\n    x='x',\n    y='y',\n    color='topic:N',\n    tooltip=['title', 'topic']\n).interactive().properties(\n    title='Clustering and Topic Modeling | 25k arXiv cs.CL publications)',\n    width=600,\n    height=400,\n)\nchart.display()\n```\n\nAnd compare the clustering results using `float32` embedding representations and `int8` [Sentence Transformers](https://www.sbert.net/) quantization:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg style=\"margin: 0 auto; display: block;\" src=\"https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/e8nQw98dKSmLaNAKfx7T4.png\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eHDBSCAN leaf-clustering using \u003cem\u003efloat32\u003c/em\u003e \u0026 \u003cem\u003equantized-int8\u003c/em\u003e embeddings (sentence-transformers-quantization)\u003c/em\u003e\u003c/p\u003e\n\nWe now perform the same comparison with our custom quantization implementation:\n\n\u003cp align=\"center\"\u003e\n  \u003cimg style=\"margin: 0 auto; display: block;\" src=\"https://cdn-uploads.huggingface.co/production/uploads/64a13b68b14ab77f9e3eb061/smL046VV2i4N1ulIRmykw.png\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eHDBSCAN leaf-clustering using \u003cem\u003efloat32\u003c/em\u003e \u0026 \u003cem\u003equantized-uint8\u003c/em\u003e embeddings (custom-quantization-implementation)\u003c/p\u003e\n\nThe clustering results using `float32` and `(u)int8` quantized embeddings show a similar general layout of well-defined clusters, indicating that (i) the HDBSCAN clustering algorithm was effective in both cases, and (ii) the core relationships in the data were maintained after quantization (using sentence transformers and our custom implementation).\n\nNotably, it can be observed that using embedding quantization resulted in both cases in slightly more granular clustering (35 clusters versus 31) that appears to be semantically coherent. Our tentative hypothesis for this difference is that scalar quantization might *paradoxically* guide the HDBSCAN clustering algorithm to separate points that were previously grouped together.\n\nThis could be due to (i) noise (quantization can create small *noisy* variations in the data, which might have a sort of *regularization* effect and lead to more sensitive clustering decisions), or due to (ii) the difference in numerical precision and alteration of distance calculations (this could amplify certain differences between points that were less pronounced in the `float32` representation). Further investigation would be necessary to fully understand the implications of quantization on clustering.\n\n### 7.2 Taxonomy Scheme\n\nThe entire scheme is available at [cs.CL.taxonomy](https://github.com/dcarpintero/taxonomy-completion/blob/main/arxiv/cs.CL.scheme.md). This approach may serve as a baseline for automatically identifying candidate schemes of classes in high-level [arXiv categories](https://arxiv.org/category_taxonomy):\n```\n. Foundations of Language Models\n  .. Model Architectures and Mechanisms \n    ... Transformer Models and Attention Mechanisms\n    ... Large Language Models (LLMs)\n  .. Model Optimization and Efficiency\n    ... Compression and Quantization\n    ... Parameter-Efficient Fine-Tuning\n    ... Knowledge Distillation\n  .. Learning Paradigms\n    ... In-Context Learning\n    ... Instruction Tuning\n\n. AI Ethics, Safety, and Societal Impact\n  .. Ethical Considerations\n    ... Bias and Fairness in Models\n    ... Alignment and Preference Optimization\n  .. Safety and Security\n    ... Hallucination in LLMs\n    ... Adversarial Attacks and Robustness\n    ... Detection of AI-Generated Text\n  .. Social Impact\n    ... Hate Speech and Offensive Language Detection\n    ... Fake News Detection\n\n[...]\n```\n\n## Citation\n\n```\n@article{carpintero2024\n  author = { Diego Carpintero},\n  title = {Taxonomy Completion with Embedding Quantization and an LLM-Pipeline: A Case Study in Computational Linguistics},\n  journal = {Hugging Face Blog},\n  year = {2024},\n  note = {https://huggingface.co/blog/dcarpintero/taxonomy-completion},\n}\n```\n\n## References\n\n- [1] Günther, et al. 2024. *Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents*. [arXiv:2310.19923](https://arxiv.org/abs/2310.19923).\n- [2] Press, . et al. 2021. *Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation*. [arXiv:2108.12409](https://arxiv.org/abs/2108.12409).\n- [3] McInnes, et al. 2018. *Umap: Uniform manifold approximation and projection for dimension reduction*. [arXiv:1802.03426](https://arxiv.org/abs/1802.03426).\n- [4] Campello, et al. 2013. *Density-Based Clustering Based on Hierarchical Density Estimates. Advances in Knowledge Discovery and Data Mining*. Vol. 7819. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 160–172. [doi:10.1007/978-3-642-37456-2_14](https://link.springer.com/chapter/10.1007/978-3-642-37456-2_14).\n- [5] Jiang, et al. 2023. *Mistral 7B*. [arXiv:2310.06825](https://arxiv.org/abs/2310.06825).\n- [6] Shakir, et al. 2024. *Binary and Scalar Embedding Quantization for Significantly Faster \u0026 Cheaper Retrieval*. [hf:shakir-embedding-quantization](https://huggingface.co/blog/embedding-quantization)\n- [7] Liu, Yue, et al. 2024. *Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents\"*. [arXiv:2405.10467](https://arxiv.org/abs/2405.10467).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcarpintero%2Ftaxonomy-completion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdcarpintero%2Ftaxonomy-completion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcarpintero%2Ftaxonomy-completion/lists"}