Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/prasukj7-arch/info-extraction

AI agent that reads through a dataset (CSV or Google Sheets) and performs a web search to retrieve specific information for each entity in a chosen column.
https://github.com/prasukj7-arch/info-extraction

dashboard-interface-design dynamic-query-management google-cloud-authentication google-sheets-api-integration large-language-models natural-language-processing python-programming-for-data-parsing real-time-data-extraction serpapi-and-web-search-automation streamlit-application-development

Last synced: 26 days ago
JSON representation

AI agent that reads through a dataset (CSV or Google Sheets) and performs a web search to retrieve specific information for each entity in a chosen column.

Awesome Lists containing this project

README

        

# AI Agent for Web Search and Information Extraction

## Project Description

This project involves creating an AI agent that reads through a dataset (CSV or Google Sheets) and performs web searches to retrieve specific information for each entity in a chosen column. The agent leverages a large language model (LLM) to parse the web results and extract requested data, such as email addresses, company details, or other specified information. The project also includes a user-friendly dashboard where users can upload files, define search queries, and view or download the extracted results.

## Key Features

- **Upload CSV files** or connect to Google Sheets.
- **Specify search queries** with dynamic placeholders for entity values.
- **Perform web searches** and extract relevant information using LLMs.
- **View extracted information** in a structured format.
- **Download the extracted results** as a CSV.

## Setup Instructions

### Prerequisites

Before setting up the project, ensure you have the following installed:

- Python 3.x
- pip (Python package installer)
- A Google Cloud Project with Sheets API enabled and a service account key for authentication (for Google Sheets integration).

### Installing Dependencies

1. Clone the repository:

```bash
git clone https://github.com/yourusername/ai-agent-web-search.git
cd ai-agent-web-search
```

2. Create a virtual environment (optional but recommended):

```bash
python3 -m venv venv
source venv/bin/activate # For macOS/Linux
venv\Scripts\activate # For Windows
```

3. Install the required dependencies:

```bash
pip install -r requirements.txt
```

### Configuring Environment Variables and Service Account

#### Step 1: Create and Configure the `.env` File

1. In the root directory of the project, create a `.env` file.
2. Add the following variables to the `.env` file:

```makefile
SERPAPI_KEY=your_serpapi_key
HUGGINGFACE_API_KEY=your_huggingface_api_key
```

- Replace `your_serpapi_key` with your SerpAPI key for performing web searches.
- Replace `your_huggingface_api_key` with your HuggingFace API key for natural language processing.

#### Step 2: Set Up the `config/` Folder

1. Create a `config/` folder in the root directory of your project.
2. Place your **Google Service Account JSON file** inside the `config/` folder. You can create a Google service account by following the instructions [here](https://cloud.google.com/docs/authentication/getting-started).

```makefile
GOOGLE_SERVICE_ACCOUNT_JSON=config/gcp_service_account.json
```

## Usage Guide

### Running the Application

Once you've installed the dependencies and set up the `.env` file and `config/` folder, you can run the application using Streamlit.

1. Start the Streamlit app:

```bash
streamlit run app.py
```

2. The application will launch in your web browser, displaying the dashboard where you can:

- **Upload CSV**: Choose a CSV file with data.
- **Connect Google Sheets**: Provide the link to your Google Sheet.
- **Select Primary Column**: Choose the column from your dataset that contains the entities (e.g., company names).
- **Define a Query**: Enter a custom query, such as "Get the email address of {company}", where `{company}` will be replaced with each entity's name from the dataset.
- **Extract Information**: Click "Run Search" to start the search process and display extracted information.
- **Download Results**: After the search completes, you can download the extracted results as a CSV file.

### Google Sheets Integration

To connect Google Sheets:

1. Ensure that your Google Sheet is shared with the link set to "Anyone with the link can view."
2. Paste the link of your Google Sheet into the input field.
3. The app will load data from the sheet, allowing you to select a column and query it for information.

## API Keys and Environment Variables

For the application to function properly, you need to configure the following API keys:

- **SerpAPI Key**: This key is used for performing web searches. You can get your key by signing up on SerpAPI.
- **HuggingFace API Key**: This key allows you to use the HuggingFace API for natural language processing. Obtain it from HuggingFace.
- **Google Service Account Key**: You will need a Google service account key to authenticate with the Google Sheets API. Follow the instructions [here](https://cloud.google.com/docs/authentication/getting-started) to create and download the service account JSON key.

Once you have the API keys, add them to your `.env` file as mentioned in the setup instructions.

---

## YouTube Video

### **AI Agent for Web Search and Information Extraction - Video Tutorial**

Watch the video below to see a demonstration of how the AI agent works, performing web searches and extracting specific information from datasets.

[![AI Agent for Web Search and Information Extraction](https://img.youtube.com/vi/roYIGPaHNAA/maxresdefault.jpg)](https://www.youtube.com/watch?v=roYIGPaHNAA)

---

### **Video Highlights**:

- **Introduction to the Project**: Overview of the AI agent and its key features.
- **How the Agent Works**: Walkthrough of how it processes CSV/Google Sheets data and performs web searches.
- **Custom Query Handling**: See how users can define custom queries to extract specific information.
- **Results Extraction**: Watch the process of collecting and viewing the extracted data.

---

### YouTube Video Details

- **Title**: AI Agent for Web Search and Information Extraction
- **Description**: In this tutorial, we explain the functionalities of the AI agent, how it interacts with datasets, and extracts information using web searches and large language models.
- **Duration**: 3:45 minutes
- **Published on**: [Date]
- **Link**: [Watch the video here](https://www.youtube.com/watch?v=roYIGPaHNAA)

---

## Folder Structure

```bash
ai-agent-web-search/
├── app.py # Main application file
├── requirements.txt # List of dependencies
├── .env # Environment variables file
├── config/ # Folder for configuration files
│ └── gcp_service_account.json # Google Service Account Key
└── README.md # Project documentation