Projects in Awesome Lists tagged with unstructured-data
A curated list of projects in awesome lists tagged with unstructured-data .
https://github.com/iterative/dvc
🦉 Data Versioning and ML Experiments
ai data-science data-version-control developer-tools machine-learning reproducibility unstructured-data
Last synced: 12 May 2025
https://github.com/voxel51/fiftyone
Refine high-quality datasets and visual AI models
active-learning artificial-intelligence computer-vision data-centric-ai data-cleaning data-curation data-quality data-science deep-learning developer-tools image-classification machine-learning object-detection python unstructured-data vector-search visualization
Last synced: 12 May 2025
https://github.com/neo4j-labs/llm-graph-builder
Neo4j graph construction from unstructured data using LLMs
data-import genai graph graph-rag graph-search graphdb graphrag knowledge-graph langchain neo4j rag unstructured-data vectordb
Last synced: 13 May 2025
https://github.com/towhee-io/towhee
Towhee is a framework that is dedicated to making neural data processing pipelines simple and fast.
computer-vision convolutional-networks embedding-vectors embeddings feature-extraction feature-vector image-processing image-retrieval llm machine-learning milvus pipeline towhee transformer unstructured-data video-processing vision-transformer vit
Last synced: 13 May 2025
https://github.com/Zipstack/unstract
No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
etl-pipeline llm-platform unstructured-data
Last synced: 07 Apr 2025
https://github.com/instill-ai/instill-core
🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications
ai api cli developer-tools etl generative-ai golang gpt hacktoberfest llm low-code no-code open-source pipeline python stable-diffusion typescript unstructured-data
Last synced: 14 May 2025
https://github.com/milvus-io/bootcamp
Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.
audio-search benchmark-testing deep-learning hacktoberfest image-classification image-recognition image-search milvus nlp python question-answering unstructured-data
Last synced: 14 May 2025
https://github.com/nomic-ai/nomic
Interact, analyze and structure massive text, image, embedding, audio and video datasets
clustering duplicate-detection embeddings python text topic-modeling unstructured-data
Last synced: 13 May 2025
https://github.com/lotus-data/lotus
Use LOTUS to process all of your datasets with LLMs and embeddings. Enjoy up to 1000x speedups with fast, accurate query processing, that's as simple as writing Pandas code
ai-data-processing data llm llm-data-processing llm-document-processing pandas python semantic-operators semantic-search unstructured-data
Last synced: 19 Oct 2025
https://github.com/renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
audio computer-vision data-centric-ai data-curation data-visualization exploratory-data-analysis hacktoberfest images machine-learning meshes timeseries unstructured-data video
Last synced: 14 May 2025
https://github.com/Renumics/spotlight
Interactively explore unstructured datasets from your dataframe.
audio computer-vision data-centric-ai data-curation data-visualization exploratory-data-analysis hacktoberfest images machine-learning meshes timeseries unstructured-data video
Last synced: 09 Apr 2025
https://github.com/databricks/lilac
Curate better data for LLMs
artificial-intelligence data-analysis dataset-analysis unstructured-data
Last synced: 10 Mar 2025
https://github.com/JSv4/OpenContracts
Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!
agent agentic-ai etl etl-pipeline llm prompt-engineering unstructured-data vector-database
Last synced: 08 May 2025
https://github.com/nuclia/nucliadb
NucliaDB, The AI Search database for RAG
ai-powered-search database language-model machine-learning mlops nuclia python rust search search-engine search-engines semantic semantic-search-engine text-classification unstructured-data vector-search vector-search-engine vectors
Last synced: 14 May 2025
https://github.com/EulerSearch/embedding_studio
Embedding Studio is a framework which allows you transform your Vector Database into a feature-rich Search Engine.
embeddings embeddings-similarity fine-tuning llm-inference query-parser search-algorithm search-engine search-query-parser semantic-similarity unstructured-data unstructured-search vector-database
Last synced: 06 Aug 2025
https://github.com/shcherbak-ai/contextgem
ContextGem: Effortless LLM extraction from documents
ai contract-analysis data-extraction document-intelligence docx docx2md docx2txt generative-ai legaltech llm llm-extraction llm-framework llm-pipeline llms nlp prompt-engineering text-analysis unstructured-data
Last synced: 13 May 2025
https://github.com/harishdeivanayagam/rowfill
Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers
document document-extraction document-parsing image-ocr langgraph llama llm nextjs ocr ocr-javascript ollama openai pdf pdfs unstructured unstructured-data vision vision-api
Last synced: 13 Apr 2025
https://github.com/RelevanceAI/relevanceai
Home of the AI workforce - Multi-agent system, AI agents & tools
clustering computer-vision embeddings natural-language-processing nlp python search search-engine unstructured-data vector-database vector-search
Last synced: 26 Aug 2025
https://github.com/relevanceai/relevanceai
Home of the AI workforce - Multi-agent system, AI agents & tools
clustering computer-vision embeddings natural-language-processing nlp python search search-engine unstructured-data vector-database vector-search
Last synced: 15 May 2025
https://github.com/velocitybolt/open-extract
Structured Data Extractor for AI Agents. Search your documents or the web for specific data and get it back in JSON or Markdown in a single tool call.
agent-tools ai autogen context-aware context-aware-structured-outputs crewai etl etl-automation etl-framework langchain langgraph llm openai python rag structured-outputs unstructured-data
Last synced: 03 Apr 2025
https://github.com/jostmey/dkm
Dynamic Kernel Matching (DKM) for Classifying Data with Non-conforming Features
dkm genomics machine-learning nonconforming-data repertoire statistical-classifiers tcell-receptors unstructured-data
Last synced: 18 Mar 2025
https://github.com/graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
claude content-extraction content-ingestion data-collection llm-tools mcp-server model-context-protocol search-api unstructured-data web-crawler web-scraping
Last synced: 12 Oct 2025
https://github.com/BartJongejan/Bracmat
Programming language for symbolic computation with unusual combination of pattern matching features: Tree patterns, associative patterns and expressions embedded in patterns.
bignumbers computer-algebra differentiation epoc expression-evaluator gcc high-level-language html json language-technology natural-language-processing pattern-matching programming-language rosettacode semi-structured-data structured-data symbolic-computation tree-structure unstructured-data xml
Last synced: 10 May 2025
https://github.com/scrapegraphai/scrapontologies
Python library for Entities, relationships and schemas extraction from documents
ai automatic-ontologies db-schema documents documentsearch extract-schema knowledge-graph library llm ontologies structured-data unstructured-data
Last synced: 03 Sep 2025
https://github.com/osllmai/indox
Indox is an advanced search and retrieval technique that efficiently extracts data from diverse document types, including PDFs and HTML, using online or offline large language models such as Openai, Hugging Face , etc.
ai document index llm ml rag structured-data unstructured-data
Last synced: 10 Apr 2025
https://github.com/tuanacelik/unstructuredio-haystack
💙 Unstructured Data Connectors for Haystack 2.0
haystack llm nlp python unstructured-data
Last synced: 06 May 2025
https://github.com/nicbet/infozilla
The infoZilla unstructured software engineering data mining tool. It can find and extract source code regions, patches, stack traces, enumerations and itemizations from discussion threads.
bugreport bugzilla data-mining data-science tools unstructured-data
Last synced: 13 Oct 2025
https://github.com/sachinkalsi/html_tag_annotator
A Machine Learning tool to create the training dataset very quickly & easily by using a smart chrome extension
annotations chrome-extension generate-training-data harvest html-tag-annotation html-text-annotator machine-learning scraper text-annotation train-dataset unstructured-data
Last synced: 28 Oct 2025
https://github.com/aclai-lab/soledata.jl
Manage logical datasets!
machine-learning multimodal-data unstructured-data
Last synced: 05 Apr 2025
https://github.com/mkearney/wibble
Web Data Frames
data-frames html r r-package rstats tbl tibble unstructured-data web-data web-scraping wrangling xml
Last synced: 12 Apr 2025
https://github.com/chaitjo/knowledge-graphs
Building Knowledge Graphs from Unstructured Text
knowledge-graph networkx neuralcoref spacy unstructured-data wikipedia
Last synced: 17 Aug 2025
https://github.com/kennethleungty/langextract-gemma-structured-extraction
Using LangExtract and Gemma 3 for structured information extraction from unstructured text in insurance polices
artificial-intelligence data-science deep-learning gemini gemma gemma3-4b google langextract large-language-models llm llms machine-learning openai structured-data unstructured-data
Last synced: 03 Sep 2025
https://github.com/rririanto/unstructured-demo-streamlit
Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own GPT and LLM projects using Unstructured.io via streamlit
ai data data-extraction gpt unstructured unstructured-data
Last synced: 09 Apr 2025
https://github.com/moindalvs/resume_screening_and_parser
Business objective- The document classification solution should significantly reduce the manual human effort in the HRM. It should achieve a higher level of accuracy and automation with minimal human intervention Sample Data Set Details: Resumes and financial documents
data-science doc2txt doc2vec docx-converter docx-to-pdf docx2txt pdf-document-processor pdf2txt streamlit text text-analysis text-classification text-mining text-processing unstructured-data
Last synced: 23 Apr 2025
https://github.com/clarifai/clarifai-python-datautils
Extract Transform and Load unstructured data into the Clarifai's AI platform
dataanalysis dataengineering ingestion ingestion-pipeline unstructured-data unstructured-data-analysis unstructured-image unstructured-text
Last synced: 18 Oct 2025
https://github.com/ntdls/katzebase
ACID compliant document-based database engine with SQL language, APIs and Management UI.
database json nosql rdbms unstructured-data
Last synced: 14 Apr 2025
https://github.com/hupe1980/go-textractor
📄 Amazon textract response parser written in go.
amazon aws golang parser textract unstructured-data
Last synced: 16 Apr 2025
https://github.com/kaloslazo/pyfusedb
Database system that combines structured data retrieval through inverted indexes with unstructured data (images, audio) search using multidimensional vector embeddings, all within a unified platform.
database inverted-index multidimensional python structures unstructured-data vector-embeddings
Last synced: 09 Aug 2025
https://github.com/frocode/realtime_streaming_unstructured-data
Real-time streaming and processing of unstructured data (spark, airflow)
airflow cicd data-devops data-engineering data-platform iac-terraform iot-platform jenkins platform-deployment spark-streaming streaming-data unstructured-data
Last synced: 26 Jul 2025
https://github.com/b-cubed-eu/comp-unstructured-data
Scripts to explore the conditions that determine the reliability of models, trends and status by comparing aggregated cubes with structured monitoring schemes
data-cubes data-quality r rstats structured-data unstructured-data
Last synced: 01 Apr 2025
https://github.com/yeisonmontoya1815/special-topics-in-data-analytics
In my PDD Data Analytics studies at Douglas College, the Special Topics course stands out as a crucial component. This specialized module delves into advanced aspects of data analysis beyond the core curriculum, offering a deep exploration of intricate domains. Through this focused study, I aim to enhance my proficiency in handling complex datasets
analytics data-science jupyter-notebook python structured-data unstructured-data
Last synced: 07 Aug 2025
https://github.com/alexandrelamarre/fission
Data analytics & Structured streaming optimized for the Edge
data-analysis data-engineering rust structured-data unstructured-data
Last synced: 28 Feb 2025
https://github.com/rosette-api-community/rosette-for-docs
Google Docs add-on offering users the ability to extract entities, translate names, and research entities on wikipedia from within their multilingual document.
entities entity-extraction extract-entities language machine-learning name-translation natural-language-processing nlp text-analytics unstructured-data
Last synced: 28 Feb 2025
https://github.com/thehousummer233/wikipedia-ai-agent
Wikipedia AI agent research assistant. LangChain's LangGraph's ReAct agent architecture, LLMs (OpenAI, Anthropic, Google), Wikipedia API, RAG with FAISS vector db, semantic chunking, GraphRAG, Streamlit frontend, terminal and web interfaces
claude deep-learning gemini large-language-model llama3-8b lm-studio nextjs notion-api openai python redis unstructured-data vector-database yfinance
Last synced: 03 Sep 2025
https://github.com/faisalman/re-parse-js
Compose a structured data from unstructured text using regex-based pattern matching
parsing-text pattern-matching unstructured-data
Last synced: 18 Mar 2025
https://github.com/shivabajelan/uploading_file_to_azure_blob_using_python
In this repository, I will show how we can automate uploading unstructured data such as pdf or png files to Azure Blob using Python.
azure blob-storage cloud python storage-account unstructured-data upload-file
Last synced: 11 Jul 2025
https://github.com/mazzasaverio/terra-text-processor
A Terraform setup for processing unstructured data on GCP with MongoDB Atlas and Confluent Kafka, featuring serverless, event-driven architecture and Cloud Run integrations.
event-driven gcp iaas kafka mongodb-atlas terraform unstructured-data
Last synced: 11 Sep 2025
https://github.com/tinaland101/uk-food-directory-project
The core of this project is based on analyzing data from the UK Food Standards Agency. This data includes food hygiene ratings of various establishments across the UK. Based on the performance ratings of data the results are chosen for casting a popular food choices.
mongodb nosql-database pymongo-database unstructured-data
Last synced: 03 Mar 2025
https://github.com/esteininger/file-processor
A Python library that uses AI to convert unstructured files (like PDFs, HTML, etc.) into structured data.
Last synced: 21 Jul 2025
https://github.com/francois-lenne/elt-mp4-quiberon
the goal of this project is to retrieve the video of the municipality of quiberon and see if a person is in or no
bigquery cicd data-engineering docker elt google-cloud-functions google-cloud-platform google-cloud-run google-cloud-storage pipeline python sql unstructured-data
Last synced: 14 Jun 2025
https://github.com/spoortimorabad/personally-identifiable-information-pii-
Detecting Personal Information and Masking Method
faker llms numpy pandas pre-trained-model python streamlit structured-data supabase torch unstructured-data
Last synced: 05 Nov 2025
https://github.com/airdac/mud
Subject repository with NLP Python apps. UPC - Master's Degree in Data Science - Mining Unstructured Data - Spring 2024
natural-language-processing nlp python unstructured-data upc
Last synced: 04 Mar 2025
https://github.com/b-cubed-eu/rsa-unstructured-data-comp
Scripts that compare aggregated cubes with structured monitoring schemes in South Africa
data-cubes data-quality r structured-data unstructured-data
Last synced: 02 Jul 2025
https://github.com/pintamonas4575/gestbd-project-maadm-upm
Proyecto de "Gestión de sistemas de datos masivos" de máster de la UPM.
elasticsearch f1 formula1 linked-data postgresql structured-data unstructured-data
Last synced: 08 Apr 2025
https://github.com/teragrep/rsm_01
Teragrep record schema mapper library for Java
data data-mining data-science datascience java-library liblognorm log-analysis log-management schema-mapper structured-data structured-logging teragrep unstructured-data
Last synced: 22 Apr 2025
https://github.com/teragrep/blf_01
Tokenizer for Teragrep
java teragrep tokenization tokenizer unstructured-data
Last synced: 22 Apr 2025
https://github.com/davidmoserai/azuredocumentintelligencechunker
A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.
agent agents azure azure-ai-document-intelligence azure-ai-search chunking document-chunking langchain layout-parser layout-parsing llm production-grade python rag react react-pdf-viewer retrieval-augmented-generation unstructured-data
Last synced: 23 Apr 2025
https://github.com/wasay8/automatedgarbageimageclassifier
Implementation of CNN models(Resnet-34 and Resnet-50) to classify garbage images into 6 major categories for sustainable development and its disposability.
computer-vision deep-learning deep-neural-networks feature-engineering image-processing unstructured-data
Last synced: 26 Mar 2025
https://github.com/katelynfaulkner/rsa-unstructured-data-comp
Scripts that compare aggregated cubes with structured monitoring schemes in South Africa
data-cubes data-quality r structured-data unstructured-data
Last synced: 02 Mar 2025
https://github.com/instill-ai/controller-vdp
🎮 A controller-vdp manages components in Instill VDP
api-first data-connector go golang grpc hacktoberfest integration low-code rest structured-data unstructured-data
Last synced: 05 Oct 2025
https://github.com/teragrep/dpf_03
Teragrep Tokenizer for Apache Spark
apache-spark bloom-filter bloomfilter spark teragrep tokenization tokenizer unstructured-data
Last synced: 09 Oct 2025