{"id":29817067,"url":"https://github.com/zurd46/structureddata","last_synced_at":"2026-04-10T16:33:12.030Z","repository":{"id":306215452,"uuid":"1025397245","full_name":"zurd46/StructuredData","owner":"zurd46","description":"A professional Node.js CLI tool for extracting or generating structured data from websites using Puppeteer and LangChain.","archived":false,"fork":false,"pushed_at":"2025-07-24T07:57:55.000Z","size":50,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-24T11:38:49.785Z","etag":null,"topics":["ai","lanchain","node","puppeteer","typescript"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zurd46.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-24T07:37:30.000Z","updated_at":"2025-07-24T07:58:11.000Z","dependencies_parsed_at":"2025-07-24T11:38:52.941Z","dependency_job_id":"8a06eef3-1ca8-440c-bbd3-2aa0c7082a49","html_url":"https://github.com/zurd46/StructuredData","commit_stats":null,"previous_names":["zurd46/structureddata"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/zurd46/StructuredData","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zurd46%2FStructuredData","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zurd46%2FStructuredData/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zurd46%2FStructuredData/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zurd46%2FStructuredData/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zurd46","download_url":"https://codeload.github.com/zurd46/StructuredData/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zurd46%2FStructuredData/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267578003,"owners_count":24110351,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-28T02:00:09.689Z","response_time":68,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","lanchain","node","puppeteer","typescript"],"created_at":"2025-07-28T20:12:08.368Z","updated_at":"2026-04-10T16:33:11.987Z","avatar_url":"https://github.com/zurd46.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# StructuredData Scraper\n\nA professional Node.js CLI tool for extracting or generating structured data from websites using Puppeteer and LangChain.\n\n## 🎯 Features\n\n- 🔍 **Automatic Detection** of existing structured data (JSON-LD, Microdata, Open Graph, Twitter Cards)\n- 🤖 **AI-powered Generation** of new structured data with LangChain/OpenAI\n- 📄 **Multiple Formats** supported (JSON-LD, Microdata, RDFa)\n- ✅ **Schema.org Validation** of generated data\n- 💾 **JSON Export** with metadata and timestamps\n- 🌐 **Web Scraping** with Puppeteer for dynamic content\n- 🎨 **Beautiful CLI** with ASCII logo and colored output\n- ⚡ **TypeScript** for maximum type safety\n\n## 📦 Installation\n\n```bash\n# Clone repository\ngit clone \u003crepository-url\u003e\ncd StructuredData\n\n# Install dependencies\nnpm install\n\n# Build project\nnpm run build\n```\n\n## 🚀 Usage\n\n### CLI with ASCII Logo\nThe tool displays a beautiful ASCII logo on startup:\n\n```bash\nnpm run dev --help\n```\n\n### Basic Analysis\n```bash\n# Analyze website\nnpm run analyze https://example.com\n\n# With custom output directory\nnpm run analyze https://example.com -- --output ./my-results\n\n# Force regeneration (even if structured data exists)\nnpm run analyze https://example.com -- --force\n```\n\n### With OpenAI API (for better AI generation)\n```bash\n# Set API key in .env file\necho \"OPENAI_API_KEY=sk-your-api-key-here\" \u003e .env\n\n# Or as parameter\nnpm run analyze https://example.com -- --openai-key \"sk-your-api-key\"\n\n# Analyze website with AI support\nnpm run analyze https://example.com\n```\n\n### Validate Structured Data\n```bash\nnpm run validate \"./output/example_com_2025-07-24.json\"\n```\n\n### Development Mode\n```bash\n# Development server with hot reload\nnpm run watch\n\n# Direct execution\nnpm run dev analyze https://example.com\n```\n\n## 📊 Output Format\n\nThe generated JSON files contain:\n\n```json\n{\n  \"metadata\": {\n    \"url\": \"https://example.com\",\n    \"analyzedAt\": \"2025-07-24T10:30:00.000Z\",\n    \"generated\": false,\n    \"structuredDataCount\": 3\n  },\n  \"structuredData\": [\n    {\n      \"type\": \"Organization\",\n      \"data\": {\n        \"@context\": \"https://schema.org\",\n        \"@type\": \"Organization\",\n        \"name\": \"Example Company\",\n        \"url\": \"https://example.com\",\n        \"description\": \"A great company\"\n      },\n      \"format\": \"json-ld\",\n      \"source\": \"script\"\n    }\n  ]\n}\n```\n\n## 🎯 Supported Schema Types\n\n- **Organization** - Companies and organizations\n- **Person** - People and profiles\n- **WebSite** - Website information\n- **Service** - Services\n- **Product** - Products\n- **LocalBusiness** - Local businesses\n- **Article/BlogPosting** - Articles and blog posts\n- **ContactPoint** - Contact information\n- **PostalAddress** - Addresses\n- **Event** - Events\n- **Review** - Reviews\n- **FAQ** - Frequently asked questions\n- and many more Schema.org types\n\n## ⚙️ Configuration\n\n### Environment Variables (.env)\n```bash\n# OpenAI API key for AI-powered generation\nOPENAI_API_KEY=sk-your-openai-api-key\n\n# Enable debug mode\nDEBUG=true\n```\n\n⚠️ **Important Security Note:** \n- The `.env` file contains sensitive API keys\n- It's already included in `.gitignore`\n- Never commit API keys to public repositories\n- Use `.env.example` as a template for other developers\n\n### CLI Options\n```bash\n# Show all available options\nnpm run dev analyze --help\n\n# Options:\n# -o, --output \u003cpath\u003e     Output directory for JSON files\n# -f, --force            Force regeneration\n# --openai-key \u003ckey\u003e     OpenAI API key\n```\n\n## 🔧 Development\n\n### Project Structure\n```\nsrc/\n├── index.ts              # CLI Entry Point with ASCII logo\n├── scraper.ts            # Main orchestration class\n├── extractors/           # Data extraction modules\n│   └── structured-data-extractor.ts\n├── generators/           # AI generation modules\n│   └── ai-content-generator.ts\n└── utils/               # Helper utilities\n    ├── logger.ts        # Logging with emojis\n    └── schema-validator.ts\n```\n\n### Available Scripts\n```bash\nnpm run build          # Compile TypeScript\nnpm run dev            # Development mode\nnpm run start          # Production execution\nnpm run watch          # Hot-reload development\nnpm run analyze        # Direct analysis\nnpm run validate       # Direct validation\n```\n\n### Dependencies\n- **Puppeteer** - Web scraping and browser automation\n- **LangChain** - AI integration and content generation\n- **Commander.js** - CLI framework\n- **Figlet** - ASCII art text for logo\n- **Chalk** - Terminal colors\n- **dotenv** - Environment variables management\n\n## 📝 Examples\n\n### 1. Extract existing structured data\n```bash\nnpm run analyze https://schema.org\nnpm run analyze https://developers.google.com\n```\n\n### 2. Generate new structured data for business website\n```bash\nnpm run analyze https://small-business.com\n```\n\n### 3. Batch processing of multiple URLs\n```bash\nnpm run analyze https://site1.com\nnpm run analyze https://site2.com\nnpm run analyze https://site3.com\n```\n\n### 4. Validate generated data\n```bash\nnpm run validate \"./output/*.json\"\n```\n\n## 🎨 Features in Detail\n\n### ASCII Logo\nThe tool displays an appealing ASCII logo with colored output on startup:\n- Cyan-colored \"StructuredData\" logo\n- Green description\n- Gray separator line\n\n### Intelligent Fallback Logic\n- Tries OpenAI API first for best results\n- Automatically falls back to basic generation\n- Extracts existing data before regeneration\n\n### Comprehensive Data Extraction\n- JSON-LD Scripts\n- Microdata Markup\n- Open Graph Meta Tags\n- Twitter Card Meta Tags\n- Contact information (email, phone)\n- Social media links\n- Website structure (headings)\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a Feature Branch (`git checkout -b feature/AmazingFeature`)\n3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)\n4. Push to the branch (`git push origin feature/AmazingFeature`)\n5. Open a Pull Request\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## 👨‍💻 Author\n\n**ZurdAI** - AI Expert for Intelligent Automation\n- Website: [zurdai.com](https://zurdai.com)\n- GitHub: [@zurd46](https://github.com/zurd46)\n- LinkedIn: [zurd46](https://www.linkedin.com/in/zurd46/)\n\n---\n\n*Built with ❤️ and AI-Power for the Swiss Tech Community*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzurd46%2Fstructureddata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzurd46%2Fstructureddata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzurd46%2Fstructureddata/lists"}