https://github.com/r-mahesh45/india_mart_webscraping

I automate web scraping with Selenium, extracting product and seller data from IndiaMART. This data is organized into pandas DataFrames and saved as Excel files. Results are then shared on LinkedIn and GitHub via API calls, streamlining data collection and boosting professional visibility.
https://github.com/r-mahesh45/india_mart_webscraping

beautifulsoup4 datacleaning selenium selenium-python webscraping

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/r-mahesh45/india_mart_webscraping
Owner: R-Mahesh45
License: mit
Created: 2024-12-01T11:05:48.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-12-01T11:52:39.000Z (10 months ago)
Last Synced: 2025-03-25T07:44:34.699Z (7 months ago)
Topics: beautifulsoup4, datacleaning, selenium, selenium-python, webscraping
Language: Jupyter Notebook
Homepage:
Size: 299 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # India_Mart_WebScraping-

This script automates web scraping using Selenium to extract product and seller information from IndiaMART. The scraped data is then structured into a pandas DataFrame and saved as Excel files. Additionally, it can be shared on LinkedIn and GitHub for professional purposes using HTTP requests and API calls.

## Here is your code broken into pieces with their importance and explanation:

---

### **1. Import Libraries**

```python

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.common.exceptions import TimeoutException, NoSuchElementException

import pandas as pd

import time

```

**Purpose:**  

These imports provide all the necessary tools for:  

- **WebDriver setup:** Initialize and manage Chrome browser automation (`webdriver`, `Service`, `Options`).  

- **Element interaction:** Locate and interact with page elements (`By`, `WebDriverWait`, `expected_conditions`).  

- **Exception handling:** Manage timeout and missing element errors (`TimeoutException`, `NoSuchElementException`).  

- **Data storage and delays:** Store scraped data in a structured format using pandas and introduce delays using `time`.

---

### **2. Define the Scraping Function**

```python

def scrape_and_save_data(url, driver_path, step_size=2):

```

**Purpose:**  

This function automates the scraping process and saves the data.  

- `url`: The webpage to scrape.  

- `driver_path`: Path to ChromeDriver executable for browser control.  

- `step_size`: Number of records to skip while saving subsets of data.

---

### **3. WebDriver Setup**

```python

chrome_options = Options()

service = Service(driver_path)

driver = webdriver.Chrome(service=service, options=chrome_options)

driver.get(url)

```

**Purpose:**  

Sets up the Chrome browser using Selenium.  

- `chrome_options`: Custom browser options (e.g., headless mode, disable extensions).  

- `driver.get(url)`: Navigates to the specified URL.  

---

### **4. Wait for Initial Elements to Load**

```python

try:

    WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "cardlinks")))

    WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "span.elps.elps1")))

except TimeoutException:

    print("Error: Elements did not load in time.")

    driver.quit()

    return

```

**Purpose:**  

Ensures essential elements (like product names and addresses) are loaded before scraping begins.  

- Uses `WebDriverWait` with a timeout of 20 seconds to wait for specific elements (`cardlinks` and `span.elps.elps1`).  

- Handles `TimeoutException` to exit gracefully if elements don't load.

---

### **5. Initialize Data Storage**

```python

product_names = []

product_links = []

seller_names = []

seller_addresses = []

```

**Purpose:**  

Creates empty lists to store scraped data for products, links, sellers, and addresses.

---

### **6. Define the Scraping Logic**

```python

def scrape_data():

    try:

        products = driver.find_elements(By.CLASS_NAME, "cardlinks")

        addresses = driver.find_elements(By.CSS_SELECTOR, "span.elps.elps1")

        for i, product in enumerate(products):

            product_name = product.text.strip()

            product_link = product.get_attribute('href')

            try:

                seller_name = product.find_element(By.XPATH, ".//following-sibling::a").text.strip()

            except NoSuchElementException:

                seller_name = "N/A"

            seller_address = addresses[i].text.strip() if i < len(addresses) else 'N/A'

            product_names.append(product_name)

            product_links.append(product_link)

            seller_names.append(seller_name)

            seller_addresses.append(seller_address)

    except Exception as e:

        print(f"Error during scraping: {e}")

```

**Purpose:**  

Extracts product and seller details from the webpage.  

- Iterates through products and their related elements (`cardlinks` and `elps`).  

- Handles missing seller names using `NoSuchElementException`.  

- Appends the data to the respective lists.

---

### **7. Pause for Manual Interaction**

```python

print("Please log in and click the 'Show More' button manually. Press Enter to continue...")

input()

```

**Purpose:**  

Allows manual login and interaction, if required, before automated scraping continues.

---

### **8. Define Scrolling and Loading Logic**

```python

def scroll_and_load():

    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        time.sleep(3)

        scrape_data()

        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            print("No more data to load.")

            break

        last_height = new_height

```

**Purpose:**  

Handles dynamic content loading by scrolling the page.  

- Scrolls to the bottom and waits for new data to load.  

- Breaks the loop when no additional data loads (`last_height == new_height`).

---

### **9. Save Data**

```python

df = pd.DataFrame({

    'Product Name': product_names,

    'Product Link': product_links,

    'Seller Name': seller_names,

    'Seller Address': seller_addresses

})

df.to_excel('korean_scraped_data.xlsx', index=False)

```

**Purpose:**  

Converts the scraped data into a structured pandas DataFrame and saves it as an Excel file.

---

### **10. Save Subset of Data**

```python

indices = list(range(0, len(df), step_size))

specific_records = df.iloc[indices].reset_index(drop=True)

nth_records = df.iloc[step_size-1::step_size].reset_index(drop=True)

specific_records.to_excel("korean_products.xlsx", index=False)

nth_records.to_excel("korean_seller.xlsx", index=False)

```

**Purpose:**  

Saves specific records (every Nth record) to separate Excel files for detailed analysis.

---

### **11. Close the Browser**

```python

driver.quit()

```

**Purpose:**  

Closes the Chrome browser instance to release resources.

---

### **12. Function Invocation**

```python

scrape_and_save_data(

    url=url_korean,  # Replace with the actual URL

    driver_path=r'C:\Users\data_architect\Downloads\chromedriver-win64 (1)\chromedriver-win64\chromedriver.exe',

    step_size=2

)

```

**Purpose:**  

Executes the scraping function with the specified URL, ChromeDriver path, and step size.  

- `url_korean`: Replace with the actual URL to scrape.  

- `driver_path`: Path to ChromeDriver executable.  

- `step_size`: Controls the interval for saving subsets of records.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/r-mahesh45/india_mart_webscraping

Awesome Lists containing this project

README