{"id":25668385,"url":"https://github.com/celiason/museum-news","last_synced_at":"2026-05-16T00:04:06.695Z","repository":{"id":263873246,"uuid":"890601317","full_name":"celiason/museum-news","owner":"celiason","description":"webapp to find out historic details about the museum","archived":false,"fork":false,"pushed_at":"2024-11-25T23:32:20.000Z","size":6688,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-26T00:27:10.194Z","etag":null,"topics":["ai","chatbot","embedding-models","llms","rag"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/celiason.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-18T21:21:38.000Z","updated_at":"2024-11-25T23:32:23.000Z","dependencies_parsed_at":"2024-11-26T00:27:30.570Z","dependency_job_id":null,"html_url":"https://github.com/celiason/museum-news","commit_stats":null,"previous_names":["celiason/museum_news","celiason/museum-news"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celiason%2Fmuseum-news","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celiason%2Fmuseum-news/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celiason%2Fmuseum-news/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/celiason%2Fmuseum-news/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/celiason","download_url":"https://codeload.github.com/celiason/museum-news/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240461622,"owners_count":19805115,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","chatbot","embedding-models","llms","rag"],"created_at":"2025-02-24T10:32:12.647Z","updated_at":"2026-05-16T00:04:06.630Z","avatar_url":"https://github.com/celiason.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Voices from the Field\n\n![](assets/field-museum-bw.png)\n\n## Introduction\n\nI’ve always been passionate about research and museums. To me, museums are like treasure troves, holding the keys to mysteries and questions about the natural world and the people who shaped it.\n\nAt the Field Museum, I’m constantly inspired by the stories behind our oldest specimens and the historic events that have shaped the museum, from its founding during the 1893 World’s Columbian Exposition to the remarkable discoveries that followed.\n\nBut uncovering these stories isn’t easy. They’re often hidden in old documents, preserved as PDF files or buried deep in archives—fascinating, but not always accessible.\n\nThat’s why I created this project: an AI chatbot that brings the museum’s rich history to life. It’s a way to make the “old” accessible and engaging, letting anyone explore the mysteries of the museum in a fun and interactive way.\n\nI’d love for you to give it a try—and who knows? Maybe I’ll see you at the museum! \n\n## Scraping \n\nThe documents were easily accessible from the [Biodiversity Heritage Library](https://www.biodiversitylibrary.org). Here is a sample page from a news publication from January, 1935:\n\n![](assets/cassowary_clip.png)\n\n## Problems (and solutions) with ingesting text\n\nThe problem with old PDF files is that they often were formatted in ways we don't use today, and sometimes the text is not searchable. I tried several ways of extracting text from PDFs.\n\n### Take 1\n\nFirst, I used the popular `pypdf2` package in Python. This was really fast, but the problem was that the text often ran together. Here's an example:\n\n\u003e NEWTAXIDERMY METHOD APPLIED TOCASSOWARY PRESERVES LIFECOLORS\\nByKarl P.Schmidt\\nAssistant Curator ofReptiles\\nAnewspecimenofthelarge flightless bird\\ncalled thecassowary wasrecently placed...\n\n\u003c!-- \u003ccode style=\"color: darkorange\"\u003etext\u003c/code\u003e --\u003e\n\n### Take 2\n\nNext up, I tried the `pymupdf4llm` package, which was designed with extracting text from PDFs specifically for use in large language models (LLM), like the chat bot I was designing. Unfortunately, the problem I ran into was that- in many of the PDF files there were multiple columns, and instead of ingesting the text in the correct reading order, the algorithm would read horizontally across a page. Needless to say, this often caused confusion. Here is an example of what I mean (notice the bolded text):\n\n\u003e NEW TAXIDERMY METHOD APPLIED TO CASSOWARY PRESERVES LIFE COLORS\\n         By Karl P. Schmidt __River, was skinned and preserved. The__ Museum\\'s taxidermy staff, into an exhibit...\n\n\u003c!-- ![](figs/text_pymupdf4llm.png) --\u003e\n\nAt first I thought, this will be fine, the LLM will figure out what I mean and correctly interpret the texts. But I was wrong. Even AI needs a little help from humans!\n\n### Take 3\n\nFinally, I tried an alternative, `pymupdf`. To my surprise, it worked really well. There wasn't the problem of missing spaces between words that `pypdf2` had, and the columns were correctly traversed, unlike the specialized `pymupdf4lmm` package. Here is an example of some extracted text:\n\n\u003e NEW TAXIDERMY METHOD APPLIED TO CASSOWARY PRESERVES\\nLIFE COLORS\\nBy Karl P. Schmidt\\nAssistant Curator of Reptiles\\nA new specimen of the large flightless bird\\ncalled the cassowary was recently...\n\nNow that we have text, onto the next step: embedding!\n\n## Text chunking\n\nA PDF will return a long string of text that can really slow down a LLM. To get around this, I used chunking where we break up the text into predefined chunks. I mentioned that too big of chunks can cause problems (i.e., whole PDFs), but chunks that are too small can also be tricky. Imagine if we just broke up the whole PDF into single characters. For example, \"Museums are great\" could become: `text = [\"M\", \"u\", \"s\", \"e\", \"u\", \"m\", \" \", \"a\", \"r\", \"e\", ...]`, and so on. The problem here is that the AI would have no idea how to figure out how to matchup up a question like \"Is the museum great?\" given just a bunch of letters. The way that the retrieval-based augmented generation (RAG) process works is by matching up chunks of text to the meaning interpreted from what a person types in. \n\nSo what is the optimum chunk size then? One way to approach this would be to try a bunch of different chunk sizes and see how it affects the way the AI works. I may end up trying this next, but for now I was excited to come across a paper on arxiv by Xiaohua Wang and colleagues ([https://arxiv.org/abs/2407.01219](https://arxiv.org/abs/2407.01219)). The paper is about best practices for chunk sizes in RAG, and I ended up going with their settings (512 characters in a chunk, 20 overlapping characters to capture meaning).\n\n## Embedding\n\nI tried several embedding models available on HuggingFace and implemented in the `SentenceTransformer` package (all-MiniLM-L6-v2, bge-small-en-v1.5, gte-base-en-v1.5). They all gave roughly similar results in terms of retrieiving relevant chunks of text, s.o I decided to stick with a relatively smaller embedding model (all-MiniLM-L6-v2, 384 dimensions) to save space.\n\n## Connecting a LLM\n\nI consider several possible LLMs (GPT-4o, GPT-4o mini, Llama 3, Claude 3.5). Since I wanted to make this free for people to use, I steered away from commercial APIs and instead focused on free LLMs available on huggingface. I landed on the Falcon-7B-Instruct model ([https://huggingface.co/tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)). This model was good for chat-based applications, was trained on a relatively large data set and had good performance according to HF leaderboards.\n\n## Try it out!\n\nThe website is now live as a streamlit app! Feel free to try it out [here](https://voices-from-the-field.streamlit.app)!\n\n\n\n## Future steps\n\n1. __Connect a specimen database__: We might be interested in .. how many birds are there? What is the first bird ever collected? Who collected the most specimens? What famous collectors were working in the 1920s?\n\n2. __Compare chunking methods, LLMs__: Other models might perform better\n\n3. __Add metadata__: Right now, the LLM has trouble understanding context about the documents. For example, what year/month they came from. I am thinking of trying regular expressions (or pulling metadata from BHL).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceliason%2Fmuseum-news","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fceliason%2Fmuseum-news","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceliason%2Fmuseum-news/lists"}