{"id":13686800,"url":"https://github.com/AllenInstitute/openai_tools","last_synced_at":"2025-05-01T09:32:39.520Z","repository":{"id":154108059,"uuid":"627136376","full_name":"AllenInstitute/openai_tools","owner":"AllenInstitute","description":"Growing collection of scripts to summarize the scientific literature using large-language models like ChatGPT.","archived":false,"fork":false,"pushed_at":"2023-05-17T16:23:34.000Z","size":19494,"stargazers_count":112,"open_issues_count":1,"forks_count":14,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-16T02:58:40.021Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AllenInstitute.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-04-12T21:26:41.000Z","updated_at":"2024-10-02T10:35:21.000Z","dependencies_parsed_at":"2024-01-14T10:14:07.311Z","dependency_job_id":null,"html_url":"https://github.com/AllenInstitute/openai_tools","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenInstitute%2Fopenai_tools","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenInstitute%2Fopenai_tools/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenInstitute%2Fopenai_tools/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AllenInstitute%2Fopenai_tools/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AllenInstitute","download_url":"https://codeload.github.com/AllenInstitute/openai_tools/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251852905,"owners_count":21654483,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T15:00:40.190Z","updated_at":"2025-05-01T09:32:34.513Z","avatar_url":"https://github.com/AllenInstitute.png","language":"HTML","funding_links":[],"categories":["HTML"],"sub_categories":[],"readme":"[![License](https://img.shields.io/badge/license-MIT-brightgreen)](LICENSE)\n\n\nThis repository contains basic scripts designed to investigate scientific \npublication PDFs utilizing the ChatGPT API. It has been created to share \nhelpful code for navigating the literature. \n**Please note that this code is experimental.** We encourage others to test it, \nand if you find it useful, feel free to share your feedback with us.\n\nInstalling\n========================\n\n1. You first need to create your conda environment as :\n\n```conda create --name \u003cyour_env_name\u003e python=3.10```\n\n2. Then activate it: \n\n```conda activate \u003cyour_env_name\u003e```\n\n3. Install the package \n\n```pip install .```\n\n4. We rely on the **adjustText** library for positioning labels that currently \nonly work from the master branch so install it using: \n\n```pip install https://github.com/Phlya/adjustText/archive/master.zip```\n\n5. Go the script folder:\n\n```cd scripts```\n\n6. Copy your openAi API key in the .env file. You can find this here: \nhttps://platform.openai.com/account/api-keys\n\nRunning scripts\n========================\n\nThis package contains a list of classes to facilitate exploring the litterature\nusing Large Language Models. Currently, we are focusing on OpenAI API but we \nplan to extend in the future. Classes are documented and tested but we \nrecommend using our scripts first (in the script/ folder). Those scripts were\ndesigned to be simple to use and facilitate growing a local database of \npublication data.  \n\n1. First, after installing, go the script folder:\n\n```cd scripts```\n\n2. You can use 3 scripts currently. \n\n* The first one is ran using\n\n```python pdf_summary.py --path_pdf \u003cpath_to_your_pdf\u003e --save_summary True```\n\nThis will save a little text file along with your pdf with the same filename \nbut with a .txt extension. \n\n* The second one is ran using \n\n```python list_pdfs_embedding.py --path_folder \u003cpath_to_your_pdfs\u003e --database_path \u003cpath_to_a_folder\u003e```\n\nThe embedding is saved automatically in the folder \nwith all your pdfs as ```tsne_embeddings.png```. Currently this plot uses the \nfilename of each pdf to assign a label. \n\n* The third one is ran using \n\n```python pubmed_embedding.py  --pubmed_query \"your query\" --field abstract --save_path  \u003cpath_to_a_png_file\u003e --database_path \u003cpath_to_a_folder\u003e```\n\nAlthought database_path is optional, we highly recommend choosing a folder on \nyour local hard drive to store your paper database. This will limit calls to \nthe Large Language Models and save you time. \nPapers embedding could scale to thousands of papers eventually if you process \nyour entire litterature. \n\nParameters for pdf_summary\n========================\n\n* **--path_pdf**: Path to a PDF file that you want to summarize.\n\nType: string\n\nDefault: os.path.join(script_path, '../example/2020.12.15.422967v4.full.pdf')\n\n* **--save_summary**: Save the generated summary in a txt file alongside the PDF \nfile.\n\nType: boolean\n\nDefault: True\n\n* **--save_raw_text**: Save the raw text in a txt file along the pdf file.\n\nType: boolean\n\nDefault: True\n\n* **--cut_bibliography**: Try not to summarize the bibliography at the end of the \nPDF file.\n\nType: boolean\n\nDefault: True\n\n* **--chunk_length**: Determines the final length of the summary by summarizing \nthe document in chunks. More chunks result in a longer summary but may lead \nto inconsistency across sections. Typically, 1 is a good value for an abstract, \nand 2 or 3 for more detailed summaries.\n\nType: integer\n\nDefault: 1\n\n* --**database_path**: Path to the database file. This is an optional argument. \nIf path is not provided, no database will be used or created. If the path is \nprovided, the database will be created if it does not exist. If it exists, \nit will be loaded and used. Use this to grow your database of papers.\n\nType: str\n\nDefault: None\n\nParameters for pubmed_embedding\n========================\n\n* --**pubmed_query**: A query made to pubmed. This can return a very large number of \npublications and they will be processed up to **max_results**. \n\nType: str\n\nDefault: None\n\n* --**field**: You can embed the title or the abstract currently. Just give 'title'\nor 'abstract'. Your choice here will depends on how much details you want your embedding \nto rely on. \n\nType: str\n\nDefault: abstract\n\n* --**database_path**: Path to the database file. This is an optional argument. \nIf path is not provided, no database will be used or created. If the path is \nprovided, the database will be created if it does not exist. If it exists, \nit will be loaded and used. Use this to grow your database of papers.\n\nType: str\n\nDefault: None\n\n* --**save_path**: Path to a local image file. This is where the plot will be saved. \n\nType: str\n\nDefault: None\n\n* --**perplexity**: Perplexity for the t-SNE plot, default is 8. \nHigher values will make the plot more spread out. \nLower values will make the plot more clustered.\n\nType: int\n\nDefault: 8\n\n* --**max_results**: Maximum number of results to fetch from pubmed\n\nType: int\n\nDefault: 100\n\nExamples\n========================\n\n* **paper_summary**\n\nSee the example/ folder for example runs. \nThere is a typical short example (length of a typical abstract) and a long \nsummary (using chunk_length 4). \n\n* **list_pdfs_embedding**\n\nHere is an example embedding for approximately 100 publications, mostly \naround in vivo calcium imaging of neuronal activity.\n\n![Example embedding](example/tsne_embeddings_example.png)\n\n* **pubmed_embedding**\n\nHere is an example embedding for the following command:\n```python pubmed_embedding.py --pubmed_query \"In vivo two photon voltage imaging\" --field abstract --save_path ../example/twophotonvoltage.html --perplexity 5```\n\nBelow is a screenshot. [The generated plot is interactive](https://alleninstitute.github.io/openai_tools/example/twophotonvoltage.html) \nto explore the content of each paper. Download the html on your machine and open in a browser.\n![Example embedding](example/twophotonvoltage.png)\n\nHere is an second embedding for the following command for a very large number of paper:\n```python pubmed_embedding.py --pubmed_query \"Mouse visual cortex\" --field abstract --save_path ../example/mouse_invivo_recordings.html --max_result 3500 --perplexity 30```\n\nBelow is a screenshot. [The generated plot is interactive](https://alleninstitute.github.io/openai_tools/example/mouse_invivo_recordings.html) \nto explore the content of each paper. Download the html on your machine and open in a browser.\n![Example embedding](example/mouse_invivo_recordings.png)\n\nDatabase\n========================\nThe papers_extractor utilizes a straightforward database created with the \nDiskCache Python package. By specifying the database_path, the module will \nhandle everything for you. Its purpose is to compile a collection of papers, \ncomplete with statistics and numerical embeddings. Additionally, it \nautomatically regulates the number of calls made to LLM APIs, effectively \nreducing costs.  Any previously processed pdf is \nautomatically cached. To determine if a pdf was previously processed, it uses\na hash from the content of the pdf, so you can freely rename them.\n\nYou can store your database (or several) at any location on your local \ndrive using the **database_path** parameter. \n\nHow to contribute?\n========================\n\n1. First go to tests/ and read the README.md \n\n2. Make a PR against main. This will run CI through github actions. \n\nCredits\n========================\nThis repository was started by Jerome Lecoq on April 12th 2023. \nPlease reach out jeromel@alleninstitute.org for any questions. \nIf this is useful to you, :wave: are welcome!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAllenInstitute%2Fopenai_tools","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAllenInstitute%2Fopenai_tools","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAllenInstitute%2Fopenai_tools/lists"}