{"id":25453417,"url":"https://github.com/pranav-kural/ledaa-web-scrapper","last_synced_at":"2026-05-16T06:38:41.473Z","repository":{"id":276864774,"uuid":"930556333","full_name":"pranav-kural/ledaa-web-scrapper","owner":"pranav-kural","description":"Web scrapper to scrap and prepare data for data ingestion in RAG pipeline of LEDAA project.","archived":false,"fork":false,"pushed_at":"2025-02-19T20:24:27.000Z","size":20,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-28T23:45:40.429Z","etag":null,"topics":["data-ingestion","langchain","ledaa","text-splitter","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pranav-kural.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-10T20:31:50.000Z","updated_at":"2025-02-19T20:24:30.000Z","dependencies_parsed_at":"2025-02-10T21:34:34.706Z","dependency_job_id":"112f41f3-7c07-4fff-882a-4ecd23db601c","html_url":"https://github.com/pranav-kural/ledaa-web-scrapper","commit_stats":null,"previous_names":["pranav-kural/ledaa-web-scrapper"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pranav-kural/ledaa-web-scrapper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pranav-kural%2Fledaa-web-scrapper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pranav-kural%2Fledaa-web-scrapper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pranav-kural%2Fledaa-web-scrapper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pranav-kural%2Fledaa-web-scrapper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pranav-kural","download_url":"https://codeload.github.com/pranav-kural/ledaa-web-scrapper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pranav-kural%2Fledaa-web-scrapper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33092867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-ingestion","langchain","ledaa","text-splitter","web-scraping"],"created_at":"2025-02-17T23:54:59.110Z","updated_at":"2026-05-16T06:38:41.447Z","avatar_url":"https://github.com/pranav-kural.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LEDAA Web Scrapper\n\nThis is a web scrapper meant to scrap HTML data from the [FRAGMENT (documentation)](https://fragment.dev/docs) webpages, and prepare and store correctly formatted markdown data to AWS S3.\n\nThe extracted markdown data then can be used as source data for knowledge base in a Retrieval Augmented Generation (RAG)-based conversational AI system or application. Large Language Models (LLMs) can easily comprehend markdown formatted data, and use of LLMs for specialized semantic chunking also becomes a possibility with markdown data, further enhancing context retrieval in RAG.\n\nTo learn more check: [Building AI Assistant for FRAGMENT documentation](https://www.pkural.ca/blog/posts/fragment/)\n\n![ledaa-web-scrapper](https://github.com/user-attachments/assets/835a681a-5737-408a-b945-16e3e40c5ab3)\n\n## Handling Data Updates\n\nTo address the challenge of dealing with obsolete or outdated information, this program also creates and stores **unique hashes** of the primary data in **AWS DynamoDB** for each webpage URL (i.e., webpage **URL** acts as `key` and the SHA-256 **hash** generated for the HTML of the primary section on that URL is stored as the `value`). A separate AWS Lambda job runs periodically to scrap HTML data for each URL and compare the hash with the stored hash in DynamoDB. If the hash is different, the Lambda job initiates the process to scrap data again for that URL and the data loading process (embedding generation + vector store update) is triggered. Each chunk in the vector store is associated through **metadata** with the URL from which it was extracted. Therefore, when data needs to be updated for a certain URL, only specific chunks are replaced.\n\n## Data Extraction\n\nThe data extraction process involves the following steps:\n\n1. **Web Scraping**: The program receives `URL` of the webpage as an argument and uses `BeautifulSoup` to scrap HTML data from the given URL.\n2. **Primary Section HTML Extraction**: First, we extract the HTML of only the section of the documentation page we are concerned with, i.e., we exclude the header, footer, and other irrelevant sections.\n3. **Content Formatting**: Certain elements are formatted optimally for markdown conversion and format standards. Our focus here is mainly on `code` elements. Both inline and block code elements are formatted correctly. Images are also replaced with links to the images, and hyperlinks are formatted correctly.\n4. **Markdown Conversion**: The formatted HTML content is converted to markdown using `markdownify` library.\n5. **Data Storage**: The markdown data is stored in AWS S3. If file for the given URL already exists, its overwritten.\n\nCode for the above steps can be found in the `core.py` file.\n\n## AWS Lambda Deployment\n\nWe deploy the web scrapper function to AWS Lambda using [Terraform](https://www.terraform.io/). The Terraform configuration files can be found in the `terraform` directory. The configuration file creates:\n\n-   Appropriate AWS role and policy for the Lambda function.\n-   AWS Lambda Layer for the Lambda function using pre-built compressed lambda layer zip file (present in `terraform/packages`, created using `create_lambda_layer.sh`).\n-   Data archive file for the core code (`core.py`).\n-   AWS Lambda function using the data archive file, the Lambda Layer, and the appropriate role.\n-   Lambda function is configured appropriately to access **AWS S3** and **AWS DynamoDB**.\n\nThere are certain scripts in `terraform` directory, like `apply.sh` and `plan.sh`, which can be used to apply and plan the Terraform configuration respectively. These scripts extract necessary environment variables from the `.env` file and pass them to Terraform.\n\nIdeally, this Lambda function will be triggered by another Lambda function which is responsible for monitoring documentation updates.\n\nSample output from a single invocation:\n\n```bash\nLEDAA Web Scrapper Lambda invoked\nScraping URL: https://fragment.dev/docs/install-the-sdk\nPrimary section content extracted\nPrimary section content processed\nSaving markdown data for https://fragment.dev/docs/install-the-sdk\nFile uploaded to S3: install-the-sdk.md\nHash saved successfully for https://fragment.dev/docs/install-the-sdk\nScraping completed for URL: https://fragment.dev/docs/install-the-sdk\n```\n\n## LICENSE\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpranav-kural%2Fledaa-web-scrapper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpranav-kural%2Fledaa-web-scrapper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpranav-kural%2Fledaa-web-scrapper/lists"}