{"id":43119296,"url":"https://github.com/andrewhinh/captafied","last_synced_at":"2026-01-31T19:10:38.824Z","repository":{"id":64845953,"uuid":"574781806","full_name":"andrewhinh/captafied","owner":"andrewhinh","description":"Multimodal Table Understanding","archived":false,"fork":false,"pushed_at":"2024-02-22T22:57:05.000Z","size":6647,"stargazers_count":8,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-04-16T02:04:53.448Z","etag":null,"topics":["data-science","python"],"latest_commit_sha":null,"homepage":"https://captafied.onrender.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andrewhinh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-12-06T03:55:25.000Z","updated_at":"2024-04-16T02:04:53.449Z","dependencies_parsed_at":"2024-02-22T23:48:28.853Z","dependency_job_id":null,"html_url":"https://github.com/andrewhinh/captafied","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/andrewhinh/captafied","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhinh%2Fcaptafied","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhinh%2Fcaptafied/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhinh%2Fcaptafied/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhinh%2Fcaptafied/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andrewhinh","download_url":"https://codeload.github.com/andrewhinh/captafied/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhinh%2Fcaptafied/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28950596,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-31T18:30:42.805Z","status":"ssl_error","status_checked_at":"2026-01-31T18:30:19.593Z","response_time":128,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-science","python"],"created_at":"2026-01-31T19:10:38.748Z","updated_at":"2026-01-31T19:10:38.810Z","avatar_url":"https://github.com/andrewhinh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# captafied\n\n\u003chttps://user-images.githubusercontent.com/40700820/214430538-da18d31c-2e7e-4511-a307-80f0903e61a4.mov\u003e\n\n## Contents\n\n- [captafied](#captafied)\n  - [Contents](#contents)\n  - [Description](#description)\n    - [Inference Pipeline](#inference-pipeline)\n    - [Usage](#usage)\n  - [Development](#development)\n    - [Contributing](#contributing)\n    - [Setup](#setup)\n    - [Repository Structure](#repository-structure)\n    - [Workflows](#workflows)\n    - [Code Style](#code-style)\n  - [Credit](#credit)\n\n## Description\n\nA website that helps users understand their spreadsheet data without the learning curve of data processing and visualization tools such as Excel or Python.\n\n### Inference Pipeline\n\nThe user must submit a table and corresponding request regarding it. Optionally, there is an option to upload an image for similarity search, classification, etc. Then, we use [OpenAI's API](#credit) to generate Python code that returns one or more of the following:\n\n- pandas DataFrames\n- Python strings/f-strings\n- Plotly graphs\n- Images\n\nthat can be used to answer the user's request. If something fails in this process, we use [pandas-profiling](#credit) to generate a descriptive table profile that can be used to help the user understand their data.\n\n### Usage\n\nSome notes about submitting inputs to the pipeline:\n\n- Only [long-form data](https://seaborn.pydata.org/tutorial/data_structure.html#long-form-vs-wide-form-data) is currently supported because we rely on [OpenAI's API](#credit) for many tasks, which doesn't actually see the data itself. Rather, it only has access to the variables associated with the data.\n- Tables can only be submitted as .csv, .xls(x), .tsv, and .ods files.\n- Images can only be submitted as .png, .jpeg and .jpg, .webp, and non-animated GIF (.gif).\n- Only up to 150,000 rows and 30 columns of data can be submitted at one time.\n\nSome examples of requests and questions that the pipeline can handle (these use the example table found in the repo and the website):\n\n- Add 10 stars to all the repos that have summaries longer than 10 words.\n  - Of the repos you just added stars to, which ones have the most stars?\n- Which rows have summaries longer than 10 words?\n  - Of the rows you just selected, which ones were released in 2020?\n- Does the Transformers repo have the most stars?\n  - What about the least?\n- What does the distribution of the stars look like?\n  - Center the title.\n- What does the Transformers icon look like?\n  - Make it half as tall.\n- How much memory does the dataset use?\n  - What's this number in MB?\n\n## Development\n\n### Contributing\n\nTo contribute, check out the [guide](./CONTRIBUTING.md).\n\n### Setup\n\n1. Install conda if necessary:\n\n   ```bash\n   # Install conda: https://conda.io/projects/conda/en/latest/user-guide/install/index.html#regular-installation\n   # If on Windows, install chocolately: https://chocolatey.org/install. Then, run:\n   # choco install make\n   ```\n\n2. Create the conda environment locally:\n\n   ```bash\n   cd captafied\n   make conda-update\n   conda activate captafied\n   make pip-tools\n   export PYTHONPATH=.\n   echo \"export PYTHONPATH=.:$PYTHONPATH\" \u003e\u003e ~/.bashrc\n   ```\n\n3. Install pre-commit:\n\n   ```bash\n   pre-commit install\n   ```\n\n4. Sign up for an OpenAI account and get an API key [here](https://beta.openai.com/account/api-keys).\n5. Populate a `.env` file with your key and the backend URL in the format of `.env.template`, and reactivate the environment.\n6. (Optional) Sign up for an AWS account [here](https://us-west-2.console.aws.amazon.com/ecr/create-repository?region=us-west-2) and set up your AWS credentials locally, referring to [this](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html#cli-configure-quickstart-config) as needed:\n\n   ```bash\n   aws configure\n   ```\n\nIf the instructions aren't working for you, head to [this Google Colab](https://colab.research.google.com/drive/1Z34DLHJm1i1e1tnknICujfZC6IaToU3k?usp=sharing), make a copy of it, and run the cells there to get an environment set up.\n\n### Repository Structure\n\nThe repo is separated into main folders that each describe a part of the ML-project lifecycle, some of which contain interactive notebooks, and supporting files and folders that store configurations and workflow scripts:\n\n```bash\n.\n├── backend\n    ├── deploy      # the AWS Lambda backend setup and continuous deployment code.\n        ├── api_serverless  # the backend handler code using AWS Lambda.\n    ├── inference   # the inference code.\n    ├── load_test   # the load testing code using Locust.\n    ├── monitoring  # the model monitoring code\n├── frontend        # the frontend code using Dash.\n├── tasks           # the pipeline testing code.\n```\n\n### Workflows\n\n- To start the app locally (uncomment code in `PredictorBackend.__init__` and set `use_url=False` to use the local model instead of the API):\n\n  ```bash\n  python frontend/app.py\n  ```\n\n- To login to AWS before deploying:\n\n  ```bash\n  . ./backend/deploy/aws_login.sh\n  ```\n\n- To deploy the backend to AWS Lambda:\n\n  ```bash\n  python backend/deploy/aws_lambda.py\n  ```\n\n### Code Style\n\n- To lint the code:\n\n  ```bash\n  pre-commit run --all-files\n  ```\n\n## Credit\n\n- OpenAI for their [API](https://openai.com/api/).\n- YData for their [pandas-profiling](https://github.com/ydataai/pandas-profiling) package.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewhinh%2Fcaptafied","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandrewhinh%2Fcaptafied","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewhinh%2Fcaptafied/lists"}