{"id":24312811,"url":"https://github.com/floressek/gmail-extractor","last_synced_at":"2026-02-06T23:02:52.299Z","repository":{"id":271465850,"uuid":"858984880","full_name":"Floressek/Gmail-extractor","owner":"Floressek","description":"Script for extracting information from emails to be then processed using OCR, fitted to custom JSON and presented in Google Sheets, version on master is to be deployed on railway","archived":false,"fork":false,"pushed_at":"2025-01-07T22:49:52.000Z","size":81,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-08T11:44:43.977Z","etag":null,"topics":["gmail","google-gmail-api","imap","ocr","openai","zod"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Floressek.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-17T21:53:26.000Z","updated_at":"2025-01-07T22:49:55.000Z","dependencies_parsed_at":"2025-01-07T23:31:41.485Z","dependency_job_id":"afd57d16-6faf-4c58-9202-1abd7bd80bb9","html_url":"https://github.com/Floressek/Gmail-extractor","commit_stats":null,"previous_names":["floressek/gmail-extractor"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Floressek/Gmail-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Floressek%2FGmail-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Floressek%2FGmail-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Floressek%2FGmail-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Floressek%2FGmail-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Floressek","download_url":"https://codeload.github.com/Floressek/Gmail-extractor/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Floressek%2FGmail-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29179569,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T22:12:24.066Z","status":"ssl_error","status_checked_at":"2026-02-06T22:12:09.859Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["gmail","google-gmail-api","imap","ocr","openai","zod"],"created_at":"2025-01-17T08:29:58.134Z","updated_at":"2026-02-06T23:02:52.282Z","avatar_url":"https://github.com/Floressek.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Gmail Extractor\r\n\r\n## Overview\r\nGmail Extractor is an automated system for processing email attachments from a Gmail account. It downloads attachments, processes them based on their file type, and saves the processed data in a structured format. The system is designed to handle various file types including PDFs, Word documents, Excel spreadsheets, CSVs, and images.\r\n\r\n## Table of Contents\r\n1. [Project Structure](#project-structure)\r\n2. [Prerequisites](#prerequisites)\r\n3. [Installation](#installation)\r\n4. [Configuration](#configuration)\r\n   - [Environment Variables](#environment-variables)\r\n   - [Google Cloud Console Setup](#google-cloud-console-setup)\r\n   - [Credentials File](#credentials-file)\r\n   - [Gmail Account Settings](#gmail-account-settings)\r\n5. [Usage](#usage) + [Process Flow](#process-flow)\r\n6. [File Processing](#file-processing)\r\n7. [Troubleshooting](#troubleshooting)\r\n8. [Deployment](#deployment)\r\n9. [Contributing](#contributing)\r\n10. [License](#license)\r\n\r\n## Project Structure\r\n```\r\ngmail-extractor/\r\n│\r\n├── config/\r\n│   └── constants.js\r\n│\r\n├── logs/\r\n│\r\n├── src/\r\n│   ├── attachments/\r\n│   │   ├── fileHandler/\r\n│   │   │   ├── imageHandler.js\r\n│   │   │   ├── pdfHandler.js\r\n│   │   │   ├── spreadsheetHandler.js\r\n│   │   │   └── wordHandler.js\r\n│   │   └── attachmentProcessor.js\r\n│   │\r\n│   ├── auth/\r\n│   │   └── authHandler.js\r\n│   │\r\n│   ├── email/\r\n│   │   ├── emailProcessor.js\r\n│   │   ├── imapListener.js\r\n│   │   └── resetEmailsAndAttachments.js\r\n│   │\r\n│   ├── google-sheets/\r\n│   │   └── google-sheets-api.js\r\n│   │\r\n│   ├── utils/\r\n│   │   ├── combineEmailData.js\r\n│   │   ├── convertPdfToImage.js\r\n│   │   ├── createDataDirectories.js\r\n│   │   ├── deleteFile.js\r\n│   │   ├── fileUtils.js\r\n│   │   └── logger.js\r\n│   │\r\n│   └── zod-json/\r\n│       ├── emailDataProcessor.js\r\n│       └── emailDataSchema.js\r\n│\r\n├── .env\r\n├── .gitignore\r\n├── credentials.json\r\n├── Dockerfile\r\n├── index.js\r\n├── package.json\r\n├── README.md\r\n└── token.json\r\n```\r\n\r\n## Prerequisites\r\n- Node.js (v14 or later)\r\n- npm or Yarn\r\n- A Gmail account\r\n- Google Cloud Console project with Gmail API enabled\r\n\r\n## Installation\r\n1. Clone the repository:\r\n   ```\r\n   git clone https://github.com/yourusername/gmail-extractor.git\r\n   cd gmail-extractor\r\n   ```\r\n\r\n2. Install dependencies:\r\n   ```\r\n   npm install\r\n   ```\r\n   or if you're using Yarn:\r\n   ```\r\n   yarn install\r\n   ```\r\n\r\n3. Copy the `.env.example` file to `.env`:\r\n   ```\r\n   cp .env.example .env\r\n   ```\r\n\r\n## Configuration\r\n\r\n### Environment Variables\r\nEdit the `.env` file and fill in your specific details:\r\n- `EMAIL_ADDRESS`: Your Gmail address\r\n- `PROCESSED_DIR`: Directory for processed attachments (e.g., `processed_attachments`)\r\n- Add any other necessary environment variables\r\n\r\n### Google Cloud Console Setup\r\n1. Go to the [Google Cloud Console](https://console.cloud.google.com/).\r\n2. Create a new project or select an existing one.\r\n3. Enable the Gmail API for your project.\r\n4. Go to \"Credentials\" and create an OAuth 2.0 Client ID.\r\n5. Set up the OAuth consent screen if prompted.\r\n6. For \"Application type\", choose \"Web application\".\r\n7. Add `http://localhost:3000/auth/google/callback` to the \"Authorized redirect URIs\".\r\n\r\n### Credentials File\r\nCreate a `credentials.json` file in the root directory with the following structure:\r\n```json\r\n{\r\n  \"web\": {\r\n    \"client_id\": \"YOUR_CLIENT_ID.apps.googleusercontent.com\",\r\n    \"project_id\": \"your-project-name\",\r\n    \"auth_uri\": \"https://accounts.google.com/o/oauth2/auth\",\r\n    \"token_uri\": \"https://oauth2.googleapis.com/token\",\r\n    \"auth_provider_x509_cert_url\": \"https://www.googleapis.com/oauth2/v1/certs\",\r\n    \"client_secret\": \"YOUR_CLIENT_SECRET\",\r\n    \"redirect_uris\": [\"http://localhost:3000/auth/google/callback\"]\r\n  }\r\n}\r\n```\r\n\r\n### Gmail Account Settings\r\n1. Enable IMAP in your Gmail settings.\r\n2. If not using OAuth, create an App Password:\r\n   - Go to your Google Account settings.\r\n   - Select \"Security\".\r\n   - Under \"Signing in to Google,\" select \"App Passwords\".\r\n   - Generate a new App Password for \"Mail\" and \"Other (Custom name)\".\r\n   - Use this password in your `.env` file instead of your regular Gmail password.\r\n\r\n## Usage\r\nTo start the Gmail extractor:\r\n```\r\nnpm start\r\n```\r\nOn first run, you'll be prompted to authorize the application. Follow the URL provided in the console to complete the OAuth2 flow.\r\n\r\n## Process Flow\r\n\r\nBelow is a sequence diagram illustrating the main process flow of the Gmail Extractor:\r\n\r\n```mermaid\r\nsequenceDiagram\r\n   participant User\r\n   participant ImapListener\r\n   participant EmailProcessor\r\n   participant AttachmentProcessor\r\n   participant FileHandlers\r\n   participant AuthHandler\r\n   participant ZodProcessor\r\n   participant OpenAIProcessor\r\n   participant Gmail\r\n   participant GoogleSheets\r\n\r\n   User-\u003e\u003eImapListener: Start application\r\n   ImapListener-\u003e\u003eAuthHandler: Request authentication\r\n   AuthHandler-\u003e\u003eGmail: Authenticate (OAuth2)\r\n   Gmail--\u003e\u003eAuthHandler: Return access token\r\n   AuthHandler--\u003e\u003eImapListener: Authentication successful\r\n\r\n   loop Listen for new emails\r\n      ImapListener-\u003e\u003eGmail: Check for new emails\r\n      Gmail--\u003e\u003eImapListener: New email notification\r\n      ImapListener-\u003e\u003eEmailProcessor: Process new email\r\n      EmailProcessor-\u003e\u003eGmail: Fetch email content\r\n      Gmail--\u003e\u003eEmailProcessor: Return email content\r\n      EmailProcessor-\u003e\u003eAttachmentProcessor: Process attachments\r\n      AttachmentProcessor-\u003e\u003eFileHandlers: Handle specific file types\r\n      FileHandlers--\u003e\u003eAttachmentProcessor: Return processed data\r\n      AttachmentProcessor--\u003e\u003eEmailProcessor: Return processed attachments\r\n      EmailProcessor-\u003e\u003eEmailProcessor: Combine email data (all_{emailId}.json)\r\n      EmailProcessor-\u003e\u003eZodProcessor: Validate combined data\r\n      ZodProcessor--\u003e\u003eEmailProcessor: Return validated data\r\n      EmailProcessor-\u003e\u003eOpenAIProcessor: Process data with OpenAI\r\n      OpenAIProcessor--\u003e\u003eEmailProcessor: Return structured data\r\n      EmailProcessor-\u003e\u003eEmailProcessor: Save processed_offer_{emailId}.json\r\n      EmailProcessor-\u003e\u003eGoogleSheets: Update spreadsheet with processed data\r\n      GoogleSheets--\u003e\u003eEmailProcessor: Confirmation\r\n   end\r\n\r\n   ImapListener-\u003e\u003eUser: Notification of processed emails\r\n```\r\n\r\n## File Processing\r\nThe system processes the following file types:\r\n- PDF: Handled by `pdfHandler.js`\r\n- Word (.doc, .docx): Handled by `wordHandler.js`\r\n- Excel (.xls, .xlsx), CSV: Handled by `spreadsheetHandler.js`\r\n- Images (.png, .jpg, .jpeg): Handled by `imageHandler.js`\r\n\r\nProcessed files and their extracted data are managed by `attachmentProcessor.js`.\r\n\r\n## Troubleshooting\r\n- If you encounter authentication issues, ensure your `credentials.json` file is correctly set up and your Gmail account settings are properly configured.\r\n- Check the logs in the `logs/` directory for detailed error messages.\r\n- For IMAP connection issues, verify that IMAP is enabled in your Gmail settings and that your network allows the connection.\r\n\r\n## Deployment\r\nFor deploying to a production environment:\r\n1. Ensure all sensitive data (like `credentials.json` and `.env`) are properly secured and not exposed in your repository.\r\n2. Consider using environment variables for all sensitive information.\r\n3. If deploying to a cloud service, follow their specific guidelines for Node.js applications.\r\n4. Use a process manager like PM2 to keep the application running continuously.\r\n\r\n## Contributing\r\nContributions are welcome! Please feel free to submit a Pull Request.\r\n\r\n## License\r\n[MIT]\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffloressek%2Fgmail-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffloressek%2Fgmail-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffloressek%2Fgmail-extractor/lists"}