{"id":13408172,"url":"https://github.com/dtflare/GPTparser","last_synced_at":"2025-03-14T12:32:12.945Z","repository":{"id":224563191,"uuid":"760538434","full_name":"dtflare/GPTparser","owner":"dtflare","description":"Use GPTparser with your OpenAI API to scrape \u0026 parse files into structured JSON files.","archived":false,"fork":false,"pushed_at":"2024-04-02T05:10:31.000Z","size":177,"stargazers_count":9,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-07-31T14:19:54.838Z","etag":null,"topics":["dataset-creation","json-mode","json-parser","openai-api-chatbot","website-scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dtflare.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-20T15:51:38.000Z","updated_at":"2024-06-23T02:31:01.000Z","dependencies_parsed_at":"2024-04-02T16:57:31.610Z","dependency_job_id":null,"html_url":"https://github.com/dtflare/GPTparser","commit_stats":null,"previous_names":["dtflare/parsr","dtflare/gptparser"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dtflare%2FGPTparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dtflare%2FGPTparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dtflare%2FGPTparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dtflare%2FGPTparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dtflare","download_url":"https://codeload.github.com/dtflare/GPTparser/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243578533,"owners_count":20313845,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset-creation","json-mode","json-parser","openai-api-chatbot","website-scraper"],"created_at":"2024-07-30T20:00:51.202Z","updated_at":"2025-03-14T12:32:11.197Z","avatar_url":"https://github.com/dtflare.png","language":"Python","funding_links":[],"categories":["\u003ca name=\"ai\"\u003e\u003c/a\u003eAI / ChatGPT"],"sub_categories":[],"readme":"# GPTparser for OPENAI Fine-Tune API #\n- GPTparser is an AI parsing tool to create datasets for OpenAI Fine-Tuning.\n- GPTparser allows you to scrape and parse webpage text into individual JSON files in Chat Completions format.\n- OpenAI Fine-Tune API requires data to be in valid JSON, Chat Completions API format, with a .jsonl dataset where each line represents a unique JSON object containing a topic within a message array.\n\n# GPTparser Overview #\n- GPTparser works great for scraping / parsing text content directly from any website into individual JSON files.\n- GPTparser is very simple to use:\n\t1. Recommended: Activate Miniconda Environment\n\t2. Recommended: export OpenAI API within Miniconda Session\n\t3. To use i.e.: GPTparser https://url.com output_file.json\n- GPTparser script utilizes 'Few Shot Prompting', to adjust data output, adjust the examples \u0026 prompt (follow 'modifying for local use' directions).\n\t- Current prompt is carefully curated to accurately parse text into Chat Completions, JSON API.\n - I suggest utilizing Miniconda Envrionments for security \u0026 version / dependency control.\n \t- FYI, for beginners this isn't required, and if you choose not to use these virutal environments then be sure to:\n  \t\t1. Install Dependency(1) \u0026 Dependency(2) below before using GPTparser\n    \t\t2. Securely activate your OpenAI API Key, and store it.\n\n\n\n- **This was built using Linux Ubuntu, please adjust the below directions per your OS.**\n- *GPTparser shares no professional affiliation with OpenAI.*\n\n# Why #\n- I created GPTparser because I couldn't find a tool that enabled me to efficiently scrape and parse content directly from URL's into structured JSON files.\n- Working directly from your Linux CLI with individually parsed files allows for more granular control over the data.\n- GPTparser was created for ease of use, to be cost effective, and to enable effective quality control to enhance access to OpenAI's Fine-tuning service and dataset curation techniques.\n\t1. GPTparser enables you to create a large 50k+ token dataset and finetune via OpenAI API for \u003c$5.\n- See the /scripts directory for further tools to help you format, validate, and combine these files into one dataset in proper Chat Completions JSON API format.\n- I'd like to gather a small community of people interested in making complex AI workflows more accessible, \u0026 in the near future I'll be designing a website / UI to host this project.\n\t- If you're interested connecting, reach out! -- websitegithub.happily959@passinbox.com\n\n# GPTparser Installation #\n## Pip Linux Directions: ## \n- Create a new Miniconda environment (Conda Env not required, but highly recommended):\n  \t1. $ conda create --name GPTparser python=3.8\n- Activate Environment:\n  \t1. $ conda activate GPTparser\n- Install Package:\n\t1. $ pip install GPTparser\n- Export your OpenAI API Key within your Miniconda environment - will expire when session ends:\n\t- Must be done everytime you start Miniconda session / start using GPTparser\n\t1. $ export OPENAI_API_KEY=\u003center_api_key\u003e\n- To use GPTparser - first create and cd into directory that will host your parsed files - then:\n\t1. $ GPTparser https://url.com output_file.json  \n\n\n## Github Linux Directions: ## \n- Clone Repo and Install Package:\n\t1. $ git clone git@github.com:dtflare/GPTparser.git\n- Navigate into Root Directory of Cloned Repo:\n  \t1. $ cd GPTparser\n- Choose **one of the two** below options to create a new Miniconda environment (Conda Env not required, but highly recommended):\n  \t1. $ conda env create -f environment.yml\n  \t2. $ conda create --name GPTparser python=3.8\n- Activate Environment:\n  \t1. $ conda activate GPTparser\n- Once in /GPTparser install the package/dependencies (Miniconda env recommended but not required):\n\t1. $ pip install .\n- Export your OpenAI API Key within your Miniconda environment - will expire when session ends:\n\t- Must be done everytime you start Miniconda session / start using GPTparser\n\t1. $ export OPENAI_API_KEY=\u003center_api_key\u003e\n- To use GPTparser - first create and cd into directory that will host your parsed files - then:\n\t1. $ GPTparser https://url.com output_file.json\n\n- Anytime in the future where you use GPTparser, all you have to do is activate the correct Miniconda env, and export your API key.\n\n\n\n### For those modifying the GPTparser for local use, follow directions below ###\n- Adjust prompts as needed, currently it will output OPENAI's Chat Completions JSON format for Fine-Tuning.\n\t1. For those planning on editing the examples/prompt for different output, and/or create a new Miniconda Env.\n- **Once changes are applied add GPTparser to $PATH.**\n- Active Miniconda Environment\n  \t1. With YAML file in GPTparser repo:\n  \t\t- $ conda env create -f environment.yml\n  \t2. Create your own:\n\t\t- $ conda create --name GPTparser python=3.8\n- Once GPTparser Miniconda session is activated, launch the below 1 \u0026 2 dependencies.\n  \t1. If you use other Python tools within your Cona Env, launch the dependencies everytime at start of every session.\n  \t   \t- For best results only don't use other tools within your GPTparser Conda Env.\n\t3. Dependency(1)\n\t\t- $ pip install langchain==0.1.4 deeplake openai==1.10.0 tiktoken\n\t4. Dependency(2), Langchain's newspaper module:\n\t\t- $ !pip install -q newspaper3k python-dotenv\n- Add API KEY:\n\t1. $ export OpenAI_API_KEY\n- mkdir \u0026 cd into parsed files host directory, then with GPTparser in $PATH:\n\t1. $ GPTparser https://url.com output_file.json\n- IF GPTparser is not in $PATH, simply use:\n\t1. $ ./GPTparser https://url.com output_file.json\n\n\n \n### Reminder ###\n- To run GPTparser after installation, or once added to your global path:\n\t1. $ GPTparser website_url.com file_name.json\n- OR:\n\t1. $ GPTparser website_url.com\n\n\n## After you have used GPTparser to parse your webpages into documents, utilize the scripts in /scripts to turn them into a usable dataset. ##\n**View comments inside individual scripts in /scripts for use directions**\n- j_val = JSON validator for individual .json files within Directory\n- combinR = combines all individual files into 1 .jsonl dataset\n- wordcount = use to get accurate word count of your combined dataset\n\n### Contributions ###\nPlease feel free to contribute! Whether it's code, directions, etc., credit will always be given.\nPlease reach out if you'd like to collab on any AI related projects: websitegithub.happily959@passinbox.com\n\nTo contribute:\n\n    Open a new issue to start a discussion around a feature idea or a bug.\n    Fork the repo and start making your changes to a new branch.\n    Include a test showing the fix \u0026 features working properly.\n    Send a pull request!\n    \n\nSubmit feature requests via the issues tab.\nSubmit Security Issues directly to my email: websitegithub.happily959@passinbox.com\n\n### If you use GPTparser for your dataset please share it - or your experience - with the community! ###\n\n### Citation ###\nIf you use GPTparser or associated tools to create a dataset or wish to refer to the baseline results published here, please use the following citation:\n\n@dtflare{GPTparser,\nauthor = {Daniel Flaherty},\ntitle = {GPTparser},\nyear = {2024},\npublisher = {GitHub},\njournal = {GitHub repository},\nhowpublished = {\\url{https://github.com/dtflare/GPTparser}}\n}\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdtflare%2FGPTparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdtflare%2FGPTparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdtflare%2FGPTparser/lists"}