https://github.com/gaurav-van/lizmotors_mobility_assignment_2

This is the Repository containing the material required for the 2nd assignment of Lizmotors mobility for the role of AI/ML Engineering Intern.
https://github.com/gaurav-van/lizmotors_mobility_assignment_2

beautifulsoup data-scraping duckduckgo-api electric-vehicles gemini-api selenium web-scraping

Last synced: 3 months ago
JSON representation

This is the Repository containing the material required for the 2nd assignment of Lizmotors mobility for the role of AI/ML Engineering Intern.

Host: GitHub
URL: https://github.com/gaurav-van/lizmotors_mobility_assignment_2
Owner: Gaurav-Van
Created: 2024-02-12T08:09:59.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-02-16T00:23:35.000Z (over 1 year ago)
Last Synced: 2025-02-02T15:14:08.933Z (5 months ago)
Topics: beautifulsoup, data-scraping, duckduckgo-api, electric-vehicles, gemini-api, selenium, web-scraping
Language: Python
Homepage:
Size: 20.5 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Lizmotors Mobility Assignment_2
This is the Repository containing the material required for the 2nd assignment of Lizmotors mobility for the role of AI/ML Engineering Intern.

## Initial Understanding of the Assignment
Building a Basic RAG (Retrieval-Augmented-Generation) or Vector Search System for an EV based Company called Canoo. We can Define RAG in simple terms: When connecting a Language Model (LLM) to a datastore, we augment it by adding extra data to a vector database (DB). The prompt is meticulously crafted to enable the LLM to not only consider the original input but also consult the vector DB for the most relevant response.

#### Basic RAG / Vector Search Architecture

```
Data Extraction -> Chunks -> Vector Embeddings -> Vector Database

Retrieval: User Query -> Vector Embeddings -> Search (Vector similarity search) from Vector Database -> Result (summarized, keywords, etc..)

Result from Retrieval stage is then synthesized with the LLM
```
![image](https://github.com/Gaurav-Van/Lizmotors_mobility_assignment_2/assets/50765800/344a65c4-1cbf-450c-8dda-7d234c38a47c)

### Task Given: Extraction Part of RAG Architecture
```
1) Based on 4 queries, find relevant web links of each query using Internet search APIs
2) Scrap relevant data from those web links and store as CSV files

Based on similarity or disimilarity between the data of all 4 queries the decision to make single or 4 CSV will be decided.
```

## My Approach
1) Place those 4 queries in a List
![image](https://github.com/Gaurav-Van/Lizmotors_mobility_assignment_2/assets/50765800/9744791f-96ff-4cf2-b447-0b50a7152be3)

2) For queries within the list, extract 10 web links each using duckduckgo API and store them in a text file consisting of links and their respective query

3) Read the Text file and extract the non_link and link part in sepeate lists. Topic = non_link part or the Queries

4) Scrape each link using the combination of Selenium and BeautifulSoup. I am extracting their p and span tag.

5) Then I am using GEMINI API to extract relevant information on the basis of respective Topic from scraped text in clean and clear format. Helps in reducing the task of data cleaning.

6) Storing Data in a CSV with following Structure. Information Column Data is in Json Format

| Query / Topic | url | Information |
| ------------- | -- | ----------- |

**Note**: _csv files contains extracted Information based on respective Query on each and every respevtive url so some NaN results are expected_

### Dependencies
- duckduckgo API
- csv
- google_api_core.exceptions
- google.generativeai
- genai
- BeautifulSouo
- Selenium

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gaurav-van/lizmotors_mobility_assignment_2

Awesome Lists containing this project

README