{"id":17646558,"url":"https://github.com/databricks/lilac","last_synced_at":"2025-03-10T18:31:28.715Z","repository":{"id":185412238,"uuid":"618149681","full_name":"databricks/lilac","owner":"databricks","description":"Curate better data for LLMs","archived":false,"fork":false,"pushed_at":"2024-03-19T12:41:30.000Z","size":38834,"stargazers_count":1016,"open_issues_count":88,"forks_count":97,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-03-07T04:47:19.399Z","etag":null,"topics":["artificial-intelligence","data-analysis","dataset-analysis","unstructured-data"],"latest_commit_sha":null,"homepage":"http://lilacml.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databricks.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-23T21:19:10.000Z","updated_at":"2025-03-05T05:36:34.000Z","dependencies_parsed_at":"2023-09-28T22:07:41.443Z","dependency_job_id":"496771a2-c442-4a22-b394-dd2abc61df19","html_url":"https://github.com/databricks/lilac","commit_stats":null,"previous_names":["lilacai/lilac","databricks/lilac"],"tags_count":52,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Flilac","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Flilac/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Flilac/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databricks%2Flilac/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databricks","download_url":"https://codeload.github.com/databricks/lilac/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242902609,"owners_count":20204130,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","data-analysis","dataset-analysis","unstructured-data"],"created_at":"2024-10-23T11:01:59.084Z","updated_at":"2025-03-10T18:31:28.707Z","avatar_url":"https://github.com/databricks.png","language":"Python","readme":"\u003ch1 align=\"center\"\u003eLilac\u003c/h1\u003e\n\u003ch3 align=\"center\" style=\"font-size: 20px; margin-bottom: 4px\"\u003eBetter data, better AI\u003c/h3\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca style=\"padding: 4px;\"  href=\"https://lilacai-lilac.hf.space/\"\u003e\n    \u003cspan style=\"margin-right: 4px; font-size: 12px\"\u003e🔗\u003c/span\u003e \u003cspan style=\"font-size: 14px\"\u003eTry the Lilac web demo!\u003c/span\u003e\n  \u003c/a\u003e\n  \u003cbr/\u003e\u003cbr/\u003e\n  \u003ca href=\"https://lilacml.com/\"\u003e\n        \u003cimg alt=\"Site\" src=\"https://img.shields.io/badge/Site-lilacml.com-ed2dd0?link=https%3A%2F%2Flilacml.com\"/\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://discord.gg/jNzw9mC8pp\"\u003e\n        \u003cimg alt=\"Discord\" src=\"https://img.shields.io/discord/1135996772280451153?label=Join%20Discord\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/lilacai/lilac/blob/main/LICENSE\"\u003e\n          \u003cimg alt=\"License Apache 2.0\" src=\"https://img.shields.io/badge/License-Apache 2.0-blue.svg?style=flat\u0026color=ed2dd0\" height=\"20\" width=\"auto\"\u003e\n    \u003c/a\u003e\n    \u003cbr/\u003e\n    \u003ca href=\"https://github.com/lilacai/lilac\"\u003e\n      \u003cimg src=\"https://img.shields.io/github/stars/lilacai/lilac?style=social\" /\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://twitter.com/lilac_ai\"\u003e\n      \u003cimg src=\"https://img.shields.io/twitter/follow/lilac_ai\" alt=\"Follow on Twitter\" /\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\nLilac is a tool for exploration, curation and quality control of datasets for training, fine-tuning\nand monitoring LLMs.\n\nLilac is used by companies like [Cohere](https://cohere.com/) and\n[Databricks](https://www.databricks.com/) to visualize, quantify and improve the quality of\npre-training and fine-tuning data.\n\nLilac runs **on-device** using open-source LLMs with a UI and Python API.\n\n## 🆒 New\n\n- [Lilac Garden](https://www.lilacml.com/#garden) is our hosted platform for blazing fast\n  dataset-level computations. [Sign up](https://forms.gle/Gz9cpeKJccNar5Lq8) to join the pilot.\n- Cluster \u0026 title millions of documents with the power of LLMs.\n  [Explore and search](https://lilacai-lilac.hf.space/datasets#lilac/OpenOrca\u0026query=%7B%7D\u0026viewPivot=true\u0026pivot=%7B%22outerPath%22%3A%5B%22question__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22question__cluster%22%2C%22cluster_title%22%5D%7D)\n  over 36,000 clusters of 4.3M documents in OpenOrca\n\n## Why use Lilac?\n\n- Explore your data interactively with LLM-powered search, filter, clustering and annotation.\n- Curate AI data, applying best practices like removing duplicates, PII and obscure content to\n  reduce dataset size and lower training cost and time.\n- Inspect and collaborate with your team on a single, centralized dataset to improve data quality.\n- Understand how data changes over time.\n\nLilac can offload expensive computations to [Lilac Garden](https://www.lilacml.com/#garden), our\nhosted platform for blazing fast dataset-level computations.\n\n\u003cimg alt=\"image\" src=\"docs/_static/dataset/dataset_cluster_view.png\"\u003e\n\n\u003e See our [3min walkthrough video](https://www.youtube.com/watch?v=RrcvVC3VYzQ)\n\n## 🔥 Getting started\n\n### 💻 Install\n\n```sh\npip install lilac[all]\n```\n\nIf you prefer no local installation, you can duplicate our\n[Spaces demo](https://lilacai-lilac.hf.space/) by following documentation\n[here](https://docs.lilacml.com/deployment/huggingface_spaces.html).\n\nFor more detailed instructions, see our\n[installation guide](https://docs.lilacml.com/getting_started/installation.html).\n\n### 🌐 Start a webserver\n\nStart a Lilac webserver with our `lilac` CLI:\n\n```sh\nlilac start ~/my_project\n```\n\nOr start the Lilac webserver from Python:\n\n```py\nimport lilac as ll\n\nll.start_server(project_dir='~/my_project')\n```\n\nThis will open start a webserver at http://localhost:5432/ where you can now load datasets and\nexplore them.\n\n### Lilac Garden\n\nLilac Garden is our hosted platform for running dataset-level computations. We utilize powerful GPUs\nto accelerate expensive signals like Clustering, Embedding, and PII.\n[Sign up](https://forms.gle/Gz9cpeKJccNar5Lq8) to join the pilot.\n\n- Cluster and title **a million** data points in **20 mins**\n- Embed your dataset at **half a billion** tokens per min\n- Run your own signal\n\n### 📊 Load data\n\nDatasets can be loaded directly from HuggingFace, Parquet, CSV, JSON,\n[LangSmith from LangChain](https://www.langchain.com/langsmith), SQLite,\n[LLamaHub](https://llamahub.ai/), Pandas, Parquet, and more. More documentation\n[here](https://docs.lilacml.com/datasets/dataset_load.html).\n\n```python\nimport lilac as ll\n\nll.set_project_dir('~/my_project')\ndataset = ll.from_huggingface('imdb')\n```\n\nIf you prefer, you can load datasets directly from the UI without writing any Python:\n\n\u003cimg width=\"600\" alt=\"image\" src=\"https://github.com/lilacai/lilac/assets/1100749/d5d385ce-f11c-47e6-9c00-ea29983e24f0\"\u003e\n\n### 🔎 Explore\n\n\u003c!-- prettier-ignore --\u003e\n\u003e [!NOTE]\n\u003e 🔗 Explore [OpenOrca](https://lilacai-lilac.hf.space/datasets#lilac/OpenOrca) and\n\u003e [its clusters](https://lilacai-lilac.hf.space/datasets#lilac/OpenOrca\u0026query=%7B%7D\u0026viewPivot=true\u0026pivot=%7B%22outerPath%22%3A%5B%22question__cluster%22%2C%22category_title%22%5D%2C%22innerPath%22%3A%5B%22question__cluster%22%2C%22cluster_title%22%5D%7D)\n\u003e before installing!\n\nOnce we've loaded a dataset, we can explore it from the UI and get a sense for what's in the data.\nMore documentation [here](https://docs.lilacml.com/datasets/dataset_explore.html).\n\n\u003cimg alt=\"image\" src=\"docs/_static/dataset/dataset_explore.png\"\u003e\n\n### ✨ Clustering\n\nCluster any text column to get automated dataset insights:\n\n```python\ndataset = ll.get_dataset('local', 'imdb')\ndataset.cluster('text') # add `use_garden=True` to offload to Lilac Garden\n```\n\n\u003c!-- prettier-ignore --\u003e\n\u003e [!TIP]\n\u003e Clustering on device can be slow or impractical, especially on machines without a powerful GPU or\n\u003e large memory. Offloading the compute to [Lilac Garden](https://www.lilacml.com/#garden), our\nhosted data processing platform, can speedup clustering by more than 100x.\n\n\u003cimg alt=\"image\" src=\"docs/_static/dataset/dataset_cluster_view.png\"\u003e\n\n### ⚡ Annotate with Signals (PII, Text Statistics, Language Detection, Neardup, etc)\n\nAnnotating data with signals will produce another column in your data.\n\n```python\ndataset = ll.get_dataset('local', 'imdb')\ndataset.compute_signal(ll.LangDetectionSignal(), 'text') # Detect language of each doc.\n\n# [PII] Find emails, phone numbers, ip addresses, and secrets.\ndataset.compute_signal(ll.PIISignal(), 'text')\n\n# [Text Statistics] Compute readability scores, number of chars, TTR, non-ascii chars, etc.\ndataset.compute_signal(ll.PIISignal(), 'text')\n\n# [Near Duplicates] Computes clusters based on minhash LSH.\ndataset.compute_signal(ll.NearDuplicateSignal(), 'text')\n\n# Print the resulting manifest, with the new field added.\nprint(dataset.manifest())\n```\n\nWe can also compute signals from the UI:\n\n\u003cimg width=\"400\" alt=\"image\" src=\"docs/_static/dataset/dataset_compute_signal_modal.png\"\u003e\n\n### 🔎 Search\n\nSemantic and conceptual search requires computing an embedding first:\n\n```python\ndataset.compute_embedding('gte-small', path='text')\n```\n\n#### Semantic search\n\nIn the UI, we can search by semantic similarity or by classic keyword search to find chunks of\ndocuments similar to a query:\n\n\u003cimg width=\"600\" alt=\"image\" src=\"https://github.com/lilacai/lilac/assets/1100749/4adb603e-8dca-43a3-a492-fd862e194a5a\"\u003e\n\n\u003cimg width=\"600\" alt=\"image\" src=\"https://github.com/lilacai/lilac/assets/1100749/fdee2127-250b-4e06-9ff9-b1023c03b72f\"\u003e\n\nWe can run the same search in Python:\n\n```python\nrows = dataset.select_rows(\n  columns=['text', 'label'],\n  searches=[\n    ll.SemanticSearch(\n      path='text',\n      embedding='gte-small')\n  ],\n  limit=1)\n\nprint(list(rows))\n```\n\n#### Conceptual search\n\nConceptual search is a much more controllable and powerful version of semantic search, where\n\"concepts\" can be taught to Lilac by providing positive and negative examples of that concept.\n\nLilac provides a set of built-in concepts, but you can create your own for very specif\n\n\u003cimg width=\"600\" alt=\"image\" src=\"https://github.com/lilacai/lilac/assets/1100749/9941024b-7c24-4d87-ae46-925f8da435e1\"\u003e\n\nWe can create a concept in Python with a few examples, and search by it:\n\n```python\nconcept_db = ll.DiskConceptDB()\ndb.create(namespace='local', name='spam')\n# Add examples of spam and not-spam.\ndb.edit('local', 'spam', ll.concepts.ConceptUpdate(\n  insert=[\n    ll.concepts.ExampleIn(label=False, text='This is normal text.'),\n    ll.concepts.ExampleIn(label=True, text='asdgasdgkasd;lkgajsdl'),\n    ll.concepts.ExampleIn(label=True, text='11757578jfdjja')\n  ]\n))\n\n# Search by the spam concept.\nrows = dataset.select_rows(\n  columns=['text', 'label'],\n  searches=[\n    ll.ConceptSearch(\n      path='text',\n      concept_namespace='lilac',\n      concept_name='spam',\n      embedding='gte-small')\n  ],\n  limit=1)\n\nprint(list(rows))\n```\n\n### 🏷️ Labeling\n\nLilac allows you to label individual points, or slices of data:\n\u003cimg width=\"600\" alt=\"image\" src=\"docs/_static/dataset/dataset_add_label_tag.png\"\u003e\n\nWe can also label all data given a filter. In this case, adding the label \"short\" to all text with a\nsmall amount of characters. This field was produced by the automatic `text_statistics` signal.\n\n\u003cimg width=\"600\" alt=\"image\" src=\"docs/_static/dataset/dataset_add_label_all_short.png\"\u003e\n\nWe can do the same in Python:\n\n```python\ndataset.add_labels(\n  'short',\n  filters=[\n    (('text', 'text_statistics', 'num_characters'), 'less', 1000)\n  ]\n)\n```\n\nLabels can be exported for downstream tasks. Detailed documentation\n[here](https://docs.lilacml.com/datasets/dataset_labels.html).\n\n## 💬 Contact\n\nFor bugs and feature requests, please\n[file an issue on GitHub](https://github.com/lilacai/lilac/issues).\n\nFor general questions, please [visit our Discord](https://discord.com/invite/jNzw9mC8pp).\n","funding_links":[],"categories":["Python","artificial-intelligence","data-analysis"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Flilac","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabricks%2Flilac","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabricks%2Flilac/lists"}