An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with unstructured-data

A curated list of projects in awesome lists tagged with unstructured-data .

https://github.com/Zipstack/unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents

etl-pipeline llm-platform unstructured-data

Last synced: 07 Apr 2025

https://github.com/instill-ai/instill-core

🔮 Instill Core is a full-stack AI infrastructure tool for data, model and pipeline orchestration, designed to streamline every aspect of building versatile AI-first applications

ai api cli developer-tools etl generative-ai golang gpt hacktoberfest llm low-code no-code open-source pipeline python stable-diffusion typescript unstructured-data

Last synced: 14 May 2025

https://github.com/milvus-io/bootcamp

Dealing with all unstructured data, such as reverse image search, audio search, molecular search, video analysis, question and answer systems, NLP, etc.

audio-search benchmark-testing deep-learning hacktoberfest image-classification image-recognition image-search milvus nlp python question-answering unstructured-data

Last synced: 14 May 2025

https://github.com/nomic-ai/nomic

Interact, analyze and structure massive text, image, embedding, audio and video datasets

clustering duplicate-detection embeddings python text topic-modeling unstructured-data

Last synced: 13 May 2025

https://github.com/lotus-data/lotus

Use LOTUS to process all of your datasets with LLMs and embeddings. Enjoy up to 1000x speedups with fast, accurate query processing, that's as simple as writing Pandas code

ai-data-processing data llm llm-data-processing llm-document-processing pandas python semantic-operators semantic-search unstructured-data

Last synced: 19 Oct 2025

https://github.com/JSv4/OpenContracts

Enterprise-grade and API-first LLM workspace for unstructured documents, including data extraction, redaction, rights management, prompt playground, and more!

agent agentic-ai etl etl-pipeline llm prompt-engineering unstructured-data vector-database

Last synced: 08 May 2025

https://github.com/harishdeivanayagam/rowfill

Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers

document document-extraction document-parsing image-ocr langgraph llama llm nextjs ocr ocr-javascript ollama openai pdf pdfs unstructured unstructured-data vision vision-api

Last synced: 13 Apr 2025

https://github.com/velocitybolt/open-extract

Structured Data Extractor for AI Agents. Search your documents or the web for specific data and get it back in JSON or Markdown in a single tool call.

agent-tools ai autogen context-aware context-aware-structured-outputs crewai etl etl-automation etl-framework langchain langgraph llm openai python rag structured-outputs unstructured-data

Last synced: 03 Apr 2025

https://github.com/jostmey/dkm

Dynamic Kernel Matching (DKM) for Classifying Data with Non-conforming Features

dkm genomics machine-learning nonconforming-data repertoire statistical-classifiers tcell-receptors unstructured-data

Last synced: 18 Mar 2025

https://github.com/BartJongejan/Bracmat

Programming language for symbolic computation with unusual combination of pattern matching features: Tree patterns, associative patterns and expressions embedded in patterns.

bignumbers computer-algebra differentiation epoc expression-evaluator gcc high-level-language html json language-technology natural-language-processing pattern-matching programming-language rosettacode semi-structured-data structured-data symbolic-computation tree-structure unstructured-data xml

Last synced: 10 May 2025

https://github.com/osllmai/indox

Indox is an advanced search and retrieval technique that efficiently extracts data from diverse document types, including PDFs and HTML, using online or offline large language models such as Openai, Hugging Face , etc.

ai document index llm ml rag structured-data unstructured-data

Last synced: 10 Apr 2025

https://github.com/tuanacelik/unstructuredio-haystack

💙 Unstructured Data Connectors for Haystack 2.0

haystack llm nlp python unstructured-data

Last synced: 06 May 2025

https://github.com/nicbet/infozilla

The infoZilla unstructured software engineering data mining tool. It can find and extract source code regions, patches, stack traces, enumerations and itemizations from discussion threads.

bugreport bugzilla data-mining data-science tools unstructured-data

Last synced: 13 Oct 2025

https://github.com/sachinkalsi/html_tag_annotator

A Machine Learning tool to create the training dataset very quickly & easily by using a smart chrome extension

annotations chrome-extension generate-training-data harvest html-tag-annotation html-text-annotator machine-learning scraper text-annotation train-dataset unstructured-data

Last synced: 28 Oct 2025

https://github.com/chaitjo/knowledge-graphs

Building Knowledge Graphs from Unstructured Text

knowledge-graph networkx neuralcoref spacy unstructured-data wikipedia

Last synced: 17 Aug 2025

https://github.com/rririanto/unstructured-demo-streamlit

Extract your docs (CSV, PDF, JSON, HTML, DOCS, Sheets and more) for your own GPT and LLM projects using Unstructured.io via streamlit

ai data data-extraction gpt unstructured unstructured-data

Last synced: 09 Apr 2025

https://github.com/moindalvs/resume_screening_and_parser

Business objective- The document classification solution should significantly reduce the manual human effort in the HRM. It should achieve a higher level of accuracy and automation with minimal human intervention Sample Data Set Details: Resumes and financial documents

data-science doc2txt doc2vec docx-converter docx-to-pdf docx2txt pdf-document-processor pdf2txt streamlit text text-analysis text-classification text-mining text-processing unstructured-data

Last synced: 23 Apr 2025

https://github.com/ntdls/katzebase

ACID compliant document-based database engine with SQL language, APIs and Management UI.

database json nosql rdbms unstructured-data

Last synced: 14 Apr 2025

https://github.com/hupe1980/go-textractor

📄 Amazon textract response parser written in go.

amazon aws golang parser textract unstructured-data

Last synced: 16 Apr 2025

https://github.com/kaloslazo/pyfusedb

Database system that combines structured data retrieval through inverted indexes with unstructured data (images, audio) search using multidimensional vector embeddings, all within a unified platform.

database inverted-index multidimensional python structures unstructured-data vector-embeddings

Last synced: 09 Aug 2025

https://github.com/b-cubed-eu/comp-unstructured-data

Scripts to explore the conditions that determine the reliability of models, trends and status by comparing aggregated cubes with structured monitoring schemes

data-cubes data-quality r rstats structured-data unstructured-data

Last synced: 01 Apr 2025

https://github.com/yeisonmontoya1815/special-topics-in-data-analytics

In my PDD Data Analytics studies at Douglas College, the Special Topics course stands out as a crucial component. This specialized module delves into advanced aspects of data analysis beyond the core curriculum, offering a deep exploration of intricate domains. Through this focused study, I aim to enhance my proficiency in handling complex datasets

analytics data-science jupyter-notebook python structured-data unstructured-data

Last synced: 07 Aug 2025

https://github.com/alexandrelamarre/fission

Data analytics & Structured streaming optimized for the Edge

data-analysis data-engineering rust structured-data unstructured-data

Last synced: 28 Feb 2025

https://github.com/rosette-api-community/rosette-for-docs

Google Docs add-on offering users the ability to extract entities, translate names, and research entities on wikipedia from within their multilingual document.

entities entity-extraction extract-entities language machine-learning name-translation natural-language-processing nlp text-analytics unstructured-data

Last synced: 28 Feb 2025

https://github.com/thehousummer233/wikipedia-ai-agent

Wikipedia AI agent research assistant. LangChain's LangGraph's ReAct agent architecture, LLMs (OpenAI, Anthropic, Google), Wikipedia API, RAG with FAISS vector db, semantic chunking, GraphRAG, Streamlit frontend, terminal and web interfaces

claude deep-learning gemini large-language-model llama3-8b lm-studio nextjs notion-api openai python redis unstructured-data vector-database yfinance

Last synced: 03 Sep 2025

https://github.com/faisalman/re-parse-js

Compose a structured data from unstructured text using regex-based pattern matching

parsing-text pattern-matching unstructured-data

Last synced: 18 Mar 2025

https://github.com/shivabajelan/uploading_file_to_azure_blob_using_python

In this repository, I will show how we can automate uploading unstructured data such as pdf or png files to Azure Blob using Python.

azure blob-storage cloud python storage-account unstructured-data upload-file

Last synced: 11 Jul 2025

https://github.com/mazzasaverio/terra-text-processor

A Terraform setup for processing unstructured data on GCP with MongoDB Atlas and Confluent Kafka, featuring serverless, event-driven architecture and Cloud Run integrations.

event-driven gcp iaas kafka mongodb-atlas terraform unstructured-data

Last synced: 11 Sep 2025

https://github.com/tinaland101/uk-food-directory-project

The core of this project is based on analyzing data from the UK Food Standards Agency. This data includes food hygiene ratings of various establishments across the UK. Based on the performance ratings of data the results are chosen for casting a popular food choices.

mongodb nosql-database pymongo-database unstructured-data

Last synced: 03 Mar 2025

https://github.com/esteininger/file-processor

A Python library that uses AI to convert unstructured files (like PDFs, HTML, etc.) into structured data.

fastapi nlp unstructured-data

Last synced: 21 Jul 2025

https://github.com/francois-lenne/elt-mp4-quiberon

the goal of this project is to retrieve the video of the municipality of quiberon and see if a person is in or no

bigquery cicd data-engineering docker elt google-cloud-functions google-cloud-platform google-cloud-run google-cloud-storage pipeline python sql unstructured-data

Last synced: 14 Jun 2025

https://github.com/perebaj/parser

Parse Unstructure text using GPT3 API

golang llm unstructured-data

Last synced: 02 Nov 2025

https://github.com/airdac/mud

Subject repository with NLP Python apps. UPC - Master's Degree in Data Science - Mining Unstructured Data - Spring 2024

natural-language-processing nlp python unstructured-data upc

Last synced: 04 Mar 2025

https://github.com/b-cubed-eu/rsa-unstructured-data-comp

Scripts that compare aggregated cubes with structured monitoring schemes in South Africa

data-cubes data-quality r structured-data unstructured-data

Last synced: 02 Jul 2025

https://github.com/pintamonas4575/gestbd-project-maadm-upm

Proyecto de "Gestión de sistemas de datos masivos" de máster de la UPM.

elasticsearch f1 formula1 linked-data postgresql structured-data unstructured-data

Last synced: 08 Apr 2025

https://github.com/davidmoserai/azuredocumentintelligencechunker

A lightweight Python library for metadata-rich document chunking in Retrieval-Augmented Generation (RAG) workflows. It leverages Azure AI Document Intelligence to enhance chunking by retaining hierarchical structure, page numbers, and bounding boxes for seamless integration with PDF viewers.

agent agents azure azure-ai-document-intelligence azure-ai-search chunking document-chunking langchain layout-parser layout-parsing llm production-grade python rag react react-pdf-viewer retrieval-augmented-generation unstructured-data

Last synced: 23 Apr 2025

https://github.com/wasay8/automatedgarbageimageclassifier

Implementation of CNN models(Resnet-34 and Resnet-50) to classify garbage images into 6 major categories for sustainable development and its disposability.

computer-vision deep-learning deep-neural-networks feature-engineering image-processing unstructured-data

Last synced: 26 Mar 2025

https://github.com/katelynfaulkner/rsa-unstructured-data-comp

Scripts that compare aggregated cubes with structured monitoring schemes in South Africa

data-cubes data-quality r structured-data unstructured-data

Last synced: 02 Mar 2025