https://github.com/luminati-io/web-scraping-for-machine-learning
Scrape web data for machine learning, set up ETL pipelines, and train models using Python. Includes step-by-step guides and code examples.
https://github.com/luminati-io/web-scraping-for-machine-learning
data-collection data-for-ai machine-learning selenium web-scraping
Last synced: 2 months ago
JSON representation
Scrape web data for machine learning, set up ETL pipelines, and train models using Python. Includes step-by-step guides and code examples.
- Host: GitHub
- URL: https://github.com/luminati-io/web-scraping-for-machine-learning
- Owner: luminati-io
- Created: 2025-03-03T12:42:36.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-03T12:52:30.000Z (over 1 year ago)
- Last Synced: 2025-09-13T01:41:28.226Z (9 months ago)
- Topics: data-collection, data-for-ai, machine-learning, selenium, web-scraping
- Homepage: https://brightdata.com/blog/web-data/web-scraping-for-machine-learning
- Size: 471 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Web Scraping for Machine Learning
[](https://brightdata.com/)
This guide explains how to collect, prepare, and use web-scraped data for machine learning projects, including [ETL](https://brightdata.com/blog/proxy-101/etl-pipeline) setup and model training tips. Before you proceed further, we recommend you get more familiar with Python web scraping.
- [Performing Scraping for Machine Learning](#performing-scraping-for-machine-learning)
- [Using Machine Learning on Scraped Data](#using-machine-learning-on-scraped-data)
- [Notes on Fitting an LSTM Neural Network](#notes-on-fitting-an-lstm-neural-network)
- [Setting Up ETLs When Scraping Data for Machine Learning](#setting-up-etls-when-scraping-data-for-machine-learning)
## What Is Machine Learning?
Machine learning (ML) is a branch of AI that enables systems to learn from data without explicit programming. It applies mathematical models to recognize patterns in data, allowing computers to make predictions based on new inputs.
## Why Web Scraping is Useful for Machine Learning
Machine learning and AI systems rely on data to train models, making web scraping a valuable tool for data professionals. Here is why web scraping is useful for ML:
- **Data collection at scale**: ML models, especially deep learning ones, require vast datasets. Web scraping enables large-scale data gathering.
- **Diverse and rich data sources**: The web provides a wide variety of data, enriching existing datasets for better model training.
- **Up-to-date information**: For models needing the latest trends (e.g., stock predictions, sentiment analysis), web scraping ensures fresh data.
- **Enhancing model performance**: More data improves model accuracy and validation, making web scraping a key resource.
- **Market analysis**: Extracting reviews, ratings, and trends aids in consumer sentiment analysis and business insights.
## Guide Prerequisites
To follow the guide, you need the following prerequisites in your system:
- Python 3.6 or newer
- Jupyter Notebook 6.x
- An IDE, such as VS Code
## Performing Scraping for Machine Learning
The step-by-step section explains how to scrape Yahoo Finance to get NVIDIA stock prices for maching learning.
### Step #1: Set up the environment
Create a repository that has the following subfolders: `data`, `notebooks`, and `scripts`.
```
scraping_project/
├── data/
│ └── ...
├── notebooks/
│ └── analysis.ipynb
├── scripts/
│ └── data_retrieval.py
└── venv/
```
In this project:
- `data_retrieval.py` will contain your scraping logic.
- `analysis.ipynb` will contain the maching learning logic.
- `data/` will contain the scraped data to analyze via maching learning.
Create the virtual environment:
```bash
python3 -m venv venv
```
To activate it, on Windows, run:
```powershell
venv\Scripts\activate
```
On macOS/Linux, execute:
```bash
source venv/bin/activate
```
Install the libraries you will need:
```bash
pip install selenium requests pandas matplotlib scikit-learn tensorflow notebook
```
### Step #2: Define the target page
To get the NVIDIA historical data, you have to go to the following URL:
```
https://finance.yahoo.com/quote/NVDA/history/
```
The page presents has filters to define how you want the data to be displayed:

To retrieve enough data for machine learning, you can filter them by 5 years. You can use this URL that includes the filter:
```
https://finance.yahoo.com/quote/NVDA/history/?frequency=1d&period1=1574082848&period2=1731931014
```
Now you have to target the following table and retrieve the data from it:

The CSS selector that defines the table is `.table` so you can write the following code in the `data_retrieval.py` file:
```python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common import NoSuchElementException
import pandas as pd
import os
# Configure Selenium
driver = webdriver.Chrome(service=Service())
# Target URL
url = "https://finance.yahoo.com/quote/NVDA/history/?frequency=1d&period1=1574082848&period2=1731931014"
driver.get(url)
# Wait for the table to load
try:
WebDriverWait(driver, 20).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".table"))
)
except NoSuchElementException:
print("The table was not found, verify the HTML structure.")
driver.quit()
exit()
# Locate the table and extract its rows
table = driver.find_element(By.CSS_SELECTOR, ".table")
rows = table.find_elements(By.TAG_NAME, "tr")
```
The above code snippet does the following:
- Sets up a Selenium Chrome driver instance
- Defines the target URL and instruct Selenium to visit it
- Waits for the table to be loaded: In this case, the target table is loaded by Javascript, so the web driver waits 20 seconds, just to be sure the table is loaded
- Intercepts the whole table by using the dedicated CSS selector
### Step #3: Retrieve the data and save them into a CSV file
Now you need to extract the headers from the table, retrieve all the data from the table, and convert the data into a [Numpy data frame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html).
You can do this with the following code:
```python
# Extract headers from the first row of the table
headers = [header.text for header in rows[0].find_elements(By.TAG_NAME, "th")]
# Extract data from the subsequent rows
data = []
for row in rows[1:]:
cols = [col.text for col in row.find_elements(By.TAG_NAME, "td")]
if cols:
data.append(cols)
# Convert data into a pandas DataFrame
df = pd.DataFrame(data, columns=headers)
```
### Step #4: Save the CSV file into the `data/` folder
The CVS file that the script generates has to be saved into the `data/` folder. Here is the code for that:
```python
# Determine the path to save the CSV file
current_dir = os.path.dirname(os.path.abspath(__file__))
# Navigate to the "data/" directory
data_dir = os.path.join(current_dir, "../data")
# Ensure the directory exists
os.makedirs(data_dir, exist_ok=True)
# Full path to the CSV file
csv_path = os.path.join(data_dir, "nvda_stock_data.csv")
# Save the DataFrame to the CSV file
df.to_csv(csv_path, index=False)
print(f"Historical stock data saved to {csv_path}")
# Close the WebDriver
driver.quit()
```
This code determines the (absolute) current path using the method `os.path.dirname()`, navigates to the `data/` folder with the method `os.path.join()`, ensures it exists with the method `os.makedirs(data_dir, exist_ok=True)`, saves the data to a CSV file with the method `df.to_csv()` from the Pandas library, and finally quits the driver.
### Step #5: Putting it all together
Here is the complete code for the `data_retrieval.py` file:
```python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common import NoSuchElementException
import pandas as pd
import os
# Configure Selenium
driver = webdriver.Chrome(service=Service())
# Target URL
url = "https://finance.yahoo.com/quote/NVDA/history/?frequency=1d&period1=1574082848&period2=1731931014"
driver.get(url)
# Wait for the table to load
try:
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "table.table.yf-j5d1ld.noDl"))
)
except NoSuchElementException:
print("The table was not found, verify the HTML structure.")
driver.quit()
exit()
# Locate the table and extract its rows
table = driver.find_element(By.CSS_SELECTOR, ".table")
rows = table.find_elements(By.TAG_NAME, "tr")
# Extract headers from the first row of the table
headers = [header.text for header in rows[0].find_elements(By.TAG_NAME, "th")]
# Extract data from the subsequent rows
data = []
for row in rows[1:]:
cols = [col.text for col in row.find_elements(By.TAG_NAME, "td")]
if cols:
data.append(cols)
# Convert data into a pandas DataFrame
df = pd.DataFrame(data, columns=headers)
# Determine the path to save the CSV file
current_dir = os.path.dirname(os.path.abspath(__file__))
# Navigate to the "data/" directory
data_dir = os.path.join(current_dir, "../data")
# Ensure the directory exists
os.makedirs(data_dir, exist_ok=True)
# Full path to the CSV file
csv_path = os.path.join(data_dir, "nvda_stock_data.csv")
# Save the DataFrame to the CSV file
df.to_csv(csv_path, index=False)
print(f"Historical stock data saved to {csv_path}")
# Close the WebDriver
driver.quit()
```
On Windows, launch the above script with:
```powershell
python data_retrieval.py
```
On Linux/macOS:
```bash
python3 data_retrieval.py
```
Here is how the output scraped data appears:

## Using Machine Learning on Scraped Data
Let's use the data in the CSV file in machine learning to make predictions.
### Step #1: Create a new Jupyter Notebook file
Navigate to the `notebooks/` folder from the main one:
```bash
cd notebooks
```
Open a Jupyter Notebook:
```bash
jupyter notebook
```
When the browser is open, click on **New > Python3 (ipykernel)** to create a new Jupyter Notebook file:

Rename the file to `analysis.ipynb`.
### Step #2: Open the CSV file and show the head
Now you can open the CSV file containing the data and show the head of the data frame:
```python
import pandas as pd
# Path to the CSV file
csv_path = "../data/nvda_stock_data.csv"
# Open the CVS file
df = pd.read_csv(csv_path)
# Show head
df.head()
```
This code goes to the `data/` folder with `csv_path = "../data/nvda_stock_data.csv"`. Then, it opens the CSV with the method [`pd.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) as a data frame and shows its head (the first 5 rows) with the method [`df.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html).
This is the expected result:

### Step #3: Visualize the trend over time of the `Adj Close` value
Now that the data frame is correctly loaded, you can visualize the trend of the `Adj Close` value, which represents the adjusted closing value:
```python
import matplotlib.pyplot as plt
# Ensure the "Date" column is in datetime forma
df["Date"] = pd.to_datetime(df["Date"])
# Sort the data by date (if not already sorted)
df = df.sort_values(by="Date")
# Plot the "Adj Close" values over time
plt.figure(figsize=(10, 6))
plt.plot(df["Date"], df["Adj Close"], label="Adj Close", linewidth=2)
# Customize the plot
plt.title("NVDA Stock Adjusted Close Prices Over Time", fontsize=16) # Sets title
plt.xlabel("Date", fontsize=12) # Sets x-axis label
plt.ylabel("Adjusted Close Price (USD)", fontsize=12) # Sets y-axis label
plt.grid(True, linestyle="--", alpha=0.6) # Defines styles of the line
plt.legend(fontsize=12) # Shows legend
plt.tight_layout()
# Show the plot
plt.show()
```
The above code does the following:
- `df["Date"]` accesses the `Date` column of the data frame and, with the method `pd.to_datetime()`, ensures that the dates are in the date format
- The `df.sort_values()` sorts the dates of the `Date` column. This ensures the data will be displayed in chronological order.
- `plt.figure()` sets the dimensions of the plot and `plt.plot()` displays it
- The lines of code under the `# Customize the plot` comment are useful to customize the plot by providing the title, the labels of the axes, and displaying the legend
- The `plt.show()` method is the one that actually allows the plot to be displayed
The expected result is something like that:

This plot shows the actual trend of the adjusted closed values over time of the NVIDIA stocks values. The machine learning model you will be training will have to predict them as best as it can.
### Step #3: Preparing data for machine learning
Let's clean up and prepare the data:
```python
from sklearn.preprocessing import MinMaxScaler
# Convert data types
df["Volume"] = pd.to_numeric(df["Volume"].str.replace(",", ""), errors="coerce")
df["Open"] = pd.to_numeric(df["Open"].str.replace(",", ""), errors="coerce")
# Handle missing values
df = df.infer_objects().interpolate()
# Select the target variable ("Adj Close") and scale the data
scaler = MinMaxScaler(feature_range=(0, 1)) # Scale data between 0 and 1
data = scaler.fit_transform(df[["Adj Close"]])
```
The above code does the following:
- Converts the `Volume` and `Open` values with the method `to_numeric()`
- Handles missing values by using interpolation to fill them with the method `interpolate()`
- Scales the data with the `MinMaxScaler()`
- Selects and transforms (scales it) the target variable `Adj Close` with the method `fit_transform()`
### Step #4: Create the train and test sets
The model used for this tutorial is an LSTM ([Long Short-Term Memory](https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts/)), which is a RNN ([Recurrent Neural Network](https://www.ibm.com/topics/recurrent-neural-networks)). You need to create a sequence of steps to allow it to learn the data:
```python
import numpy as np
# Create sequences of 60 time steps for prediction
sequence_length = 60
X, y = [], []
for i in range(sequence_length, len(data)):
X.append(data[i - sequence_length:i, 0]) # Last 60 days
y.append(data[i, 0]) # Target value
X, y = np.array(X), np.array(y)
# Split into training and test sets
split_index = int(len(X) * 0.8) # 80% training, 20% testing
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]
```
The above code snippet:
- Creates a sequence of 60 time steps. `X` is the array of the features, `y` is the array of the target value.
- Splits the initial data frame: 80% becomes the train set, 20% becomes the test set.
### Step #5: Train the model
Let's train the RNN on the train set:
```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
# Reshape X for LSTM [samples, time steps, features]
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
# Build the Sequential Neural Network
model = Sequential()
model.add(LSTM(32, activation="relu", return_sequences=False))
model.add(Dense(1))
model.compile(loss="mean_squared_error", optimizer="adam")
# Train the Model
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), verbose=1)
```
This code does the following:
- Respahes the array of the features to be ready for the LSTM neural network by using the method `reshape()`, both for the train and test sets
- Builds the LSTM neural network by setting its parameters
- Fits the LSTM to the train set by using the method `fit()`
In other words, the model has now fitted the train set and it is ready to make predictions.
### Step #6: Make predictions and evaluate the model performance
Let's evaluate the model's performance:
```python
from sklearn.metrics import mean_squared_error, r2_score
# Make Predictions
y_pred = model.predict(X_test)
# Inverse scale predictions and actual values for comparison
y_test = scaler.inverse_transform(y_test.reshape(-1, 1))
y_pred = scaler.inverse_transform(y_pred)
# Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# print results
print("\nLSTM Neural Network Results:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")
```
This code does the following:
- Inverses the values on the horizontal axis so that the data can be lately presented in chronological order. This is done with the method `inverse_transform()`.
- Evaluates the model by using the [mean squared error](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_squared_error.html) and the [R^2 score](https://scikit-learn.org/dev/modules/generated/sklearn.metrics.r2_score.html).
Statistical errors are possible due to the stochastical nature of ML models. Here is the expected result:

These results indicate that the model is good to predict the `Adj Close`.
### Step #7: Compare actual vs predicted values with a plot
Comparing results using machine learning isn't always sufficient. Let's create a plot that compares the actual values of the `Adj Close` with the predicted ones by the LSTM model:
```python
# Visualize the Results
test_results = pd.DataFrame({
"Date": df["Date"].iloc[len(df) - len(y_test):], # Test set dates
"Actual": y_test.flatten(),
"Predicted": y_pred.flatten()
})
# Setting plot
plt.figure(figsize=(12, 6))
plt.plot(test_results["Date"], test_results["Actual"], label="Actual Adjusted Close", color="blue", linewidth=2)
plt.plot(test_results["Date"], test_results["Predicted"], label="Predicted Adjusted Close", color="orange", linestyle="--", linewidth=2)
plt.title("Actual vs Predicted Adjusted Close Prices (LSTM)", fontsize=16)
plt.xlabel("Date", fontsize=12)
plt.ylabel("Adjusted Close Price (USD)", fontsize=12)
plt.legend()
plt.grid(alpha=0.6)
plt.tight_layout()
plt.show()
```
This code:
- Sets the comparison of the actual and predicted values on the level of the test set, so the actual values have to be trimmed to the shape that the test set has. This is done with the methods `iloc()` and `flatten()`.
- Creates the plot, adds labels to the axes, and the title, and manages other settings to improve the visualization.
The expected result is something like this:

As the plot illustrates, the LSTM neural network's predicted values (yellow dotted line) closely match the actual values (solid blue line). While the analytical results were promising, the visualization further confirms their accuracy.
### Step #8: Putting it all together
Here is the complete code for the `analysis.ipynb` notebook:
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error, r2_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
# Path to the CSV file
csv_path = "../data/nvda_stock_data.csv"
# Open CSV as data frame
df = pd.read_csv(csv_path)
# Convert "Date" to datetime format
df["Date"] = pd.to_datetime(df["Date"])
# Sort by date
df = df.sort_values(by="Date")
# Convert data types
df["Volume"] = pd.to_numeric(df["Volume"].str.replace(",", ""), errors="coerce")
df["Open"] = pd.to_numeric(df["Open"].str.replace(",", ""), errors="coerce")
# Handle missing values
df = df.infer_objects().interpolate()
# Select the target variable ("Adj Close") and scale the data
scaler = MinMaxScaler(feature_range=(0, 1)) # Scale data between 0 and 1
data = scaler.fit_transform(df[["Adj Close"]])
# Prepare the Data for LSTM
# Create sequences of 60 time steps for prediction
sequence_length = 60
X, y = [], []
for i in range(sequence_length, len(data)):
X.append(data[i - sequence_length:i, 0]) # Last 60 days
y.append(data[i, 0]) # Target value
X, y = np.array(X), np.array(y)
# Split into training and test sets
split_index = int(len(X) * 0.8) # 80% training, 20% testing
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]
# Reshape X for LSTM [samples, time steps, features]
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
# Build the Sequential Neural Network
model = Sequential()
model.add(LSTM(32, activation="relu", return_sequences=False))
model.add(Dense(1))
model.compile(loss="mean_squared_error", optimizer="adam")
# Train the Model
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), verbose=1)
# Make Predictions
y_pred = model.predict(X_test)
# Inverse scale predictions and actual values for comparison
y_test = scaler.inverse_transform(y_test.reshape(-1, 1))
y_pred = scaler.inverse_transform(y_pred)
# Evaluate the Model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print results
print("\nLSTM Neural Network Results:")
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.2f}")
# Visualize the Results
test_results = pd.DataFrame({
"Date": df["Date"].iloc[len(df) - len(y_test):], # Test set dates
"Actual": y_test.flatten(),
"Predicted": y_pred.flatten()
})
# Setting plot
plt.figure(figsize=(12, 6))
plt.plot(test_results["Date"], test_results["Actual"], label="Actual Adjusted Close", color="blue", linewidth=2)
plt.plot(test_results["Date"], test_results["Predicted"], label="Predicted Adjusted Close", color="orange", linestyle="--", linewidth=2)
plt.title("Actual vs Predicted Adjusted Close Prices (LSTM)", fontsize=16)
plt.xlabel("Date", fontsize=12)
plt.ylabel("Adjusted Close Price (USD)", fontsize=12)
plt.legend()
plt.grid(alpha=0.6)
plt.tight_layout()
plt.show()
```
This code goes straight to the goal, skipping initial data previews and plotting only `Adj Close` values, as these steps were covered earlier for preliminary analysis.
> **Note**:
>
> While the code is shown in parts, it's best to run the full code at once due to ML's stochastic nature; otherwise, the final plot may vary significantly.
## Notes on Fitting an LSTM Neural Network
For simplicity, this guide focuses directly on fitting an LSTM neural network. However, in real-world ML applications, the process involves several key steps:
1. **Preliminary Data Analysis**. This is the most crucial step, where you understand your data, clean NaN values, handle duplicates, and resolve any mathematical inconsistencies.
2. **Training ML Models**. The first model you try may not be the best. A common approach is [spot-checking](https://machinelearningmastery.com/spot-check-machine-learning-algorithms-in-python/), which involves:
- Training 3-4 ML models on the training set and evaluating their performance.
- Selecting the top 2-3 models and [tuning their hyperparameters](https://scikit-learn.org/1.5/modules/grid_search.html).
- Comparing the best-tuned models on the test set.
- Choosing the highest-performing model.
3. **Deployment**. The best-performing model is then deployed for production use.
## Setting Up ETLs When Scraping Data for Machine Learning
Saving web-scraped data as a CSV is a common practice in machine learning, especially at the start of a project when searching for the best predictive model.
Once the best model is identified, an **ETL (Extract, Transform, Load) pipeline** is typically set up to automate data retrieval, cleaning, and storage.
Here is the ETL Process for ML Workflows:
- **Extract**: Retrieve data from various sources, including web scraping.
- **Transform**: Clean and prepare the collected data.
- **Load**: Store the processed data in a database or data warehouse.
Once stored, the data is integrated into the ML workflow to **re-train and re-validate the model** with new data.
## Conclusion
Need data for machine learning but not familiar with web scraping? [Check out our solutions for efficient data retrieval](https://brightdata.com/use-cases/data-for-ai).
Sign up for a free Bright Data account to try our scraper APIs or explore our datasets.