{"id":26587503,"url":"https://github.com/luminati-io/web-scraping-for-machine-learning","last_synced_at":"2026-04-13T20:32:07.842Z","repository":{"id":283784049,"uuid":"942016419","full_name":"luminati-io/web-scraping-for-machine-learning","owner":"luminati-io","description":"Scrape web data for machine learning, set up ETL pipelines, and train models using Python. Includes step-by-step guides and code examples.","archived":false,"fork":false,"pushed_at":"2025-03-03T12:52:30.000Z","size":482,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-13T01:41:28.226Z","etag":null,"topics":["data-collection","data-for-ai","machine-learning","selenium","web-scraping"],"latest_commit_sha":null,"homepage":"https://brightdata.com/blog/web-data/web-scraping-for-machine-learning","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/luminati-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-03T12:42:36.000Z","updated_at":"2025-03-03T13:41:38.000Z","dependencies_parsed_at":"2025-03-22T07:02:08.557Z","dependency_job_id":"df3b5650-6bcf-4af3-a5e6-dda4b692cf8a","html_url":"https://github.com/luminati-io/web-scraping-for-machine-learning","commit_stats":null,"previous_names":["luminati-io/web-scraping-for-machine-learning"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/luminati-io/web-scraping-for-machine-learning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fweb-scraping-for-machine-learning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fweb-scraping-for-machine-learning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fweb-scraping-for-machine-learning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fweb-scraping-for-machine-learning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/luminati-io","download_url":"https://codeload.github.com/luminati-io/web-scraping-for-machine-learning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/luminati-io%2Fweb-scraping-for-machine-learning/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31770718,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T20:17:16.280Z","status":"ssl_error","status_checked_at":"2026-04-13T20:17:08.216Z","response_time":93,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-collection","data-for-ai","machine-learning","selenium","web-scraping"],"created_at":"2025-03-23T12:20:09.769Z","updated_at":"2026-04-13T20:32:07.812Z","avatar_url":"https://github.com/luminati-io.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scraping for Machine Learning\n\n[![Promo](https://github.com/luminati-io/LinkedIn-Scraper/raw/main/Proxies%20and%20scrapers%20GitHub%20bonus%20banner.png)](https://brightdata.com/) \n\nThis guide explains how to collect, prepare, and use web-scraped data for machine learning projects, including [ETL](https://brightdata.com/blog/proxy-101/etl-pipeline) setup and model training tips. Before you proceed further, we recommend you get more familiar with Python web scraping. \n\n- [Performing Scraping for Machine Learning](#performing-scraping-for-machine-learning)\n- [Using Machine Learning on Scraped Data](#using-machine-learning-on-scraped-data)\n- [Notes on Fitting an LSTM Neural Network](#notes-on-fitting-an-lstm-neural-network)\n- [Setting Up ETLs When Scraping Data for Machine Learning](#setting-up-etls-when-scraping-data-for-machine-learning)\n\n## What Is Machine Learning?\n\nMachine learning (ML) is a branch of AI that enables systems to learn from data without explicit programming. It applies mathematical models to recognize patterns in data, allowing computers to make predictions based on new inputs.\n\n## Why Web Scraping is Useful for Machine Learning\n\nMachine learning and AI systems rely on data to train models, making web scraping a valuable tool for data professionals. Here is why web scraping is useful for ML:\n\n- **Data collection at scale**: ML models, especially deep learning ones, require vast datasets. Web scraping enables large-scale data gathering.\n- **Diverse and rich data sources**: The web provides a wide variety of data, enriching existing datasets for better model training.\n- **Up-to-date information**: For models needing the latest trends (e.g., stock predictions, sentiment analysis), web scraping ensures fresh data.\n- **Enhancing model performance**: More data improves model accuracy and validation, making web scraping a key resource.\n- **Market analysis**: Extracting reviews, ratings, and trends aids in consumer sentiment analysis and business insights.\n\n## Guide Prerequisites\n\nTo follow the guide, you need the following prerequisites in your system:\n\n- Python 3.6 or newer\n- Jupyter Notebook 6.x\n- An IDE, such as VS Code\n\n## Performing Scraping for Machine Learning\n\nThe step-by-step section explains how to scrape Yahoo Finance to get NVIDIA stock prices for maching learning.\n\n### Step #1: Set up the environment\n\nCreate a repository that has the following subfolders: `data`, `notebooks`, and `scripts`.\n\n```\nscraping_project/\n├── data/\n│   └── ...\n├── notebooks/\n│   └── analysis.ipynb\n├── scripts/\n│   └── data_retrieval.py\n└── venv/\n```\n\nIn this project:\n\n- `data_retrieval.py` will contain your scraping logic.\n- `analysis.ipynb` will contain the maching learning logic.\n- `data/` will contain the scraped data to analyze via maching learning.\n\nCreate the virtual environment:\n\n```bash\npython3 -m venv venv \n```\n\nTo activate it, on Windows, run:\n\n```powershell\nvenv\\Scripts\\activate\n```\n\nOn macOS/Linux, execute:\n\n```bash\nsource venv/bin/activate\n```\n\nInstall the libraries you will need:\n\n```bash\npip install selenium requests pandas matplotlib scikit-learn tensorflow notebook\n```\n\n### Step #2: Define the target page\n\nTo get the NVIDIA historical data, you have to go to the following URL:\n\n```\nhttps://finance.yahoo.com/quote/NVDA/history/\n```\n\nThe page presents has filters to define how you want the data to be displayed:\n\n![filters that allow you to define how you want the data to be displayed](https://brightdata.com/wp-content/uploads/2024/11/image-53-1024x91.png)\n\nTo retrieve enough data for machine learning, you can filter them by 5 years. You can use this URL that includes the filter:\n\n```\nhttps://finance.yahoo.com/quote/NVDA/history/?frequency=1d\u0026period1=1574082848\u0026period2=1731931014\n```\n\nNow you have to target the following table and retrieve the data from it:\n\n![Table with daily financial data like open and close price, low, high, and more](https://github.com/luminati-io/web-scraping-for-machine-learning/blob/main/images/image-54.png)\n\nThe CSS selector that defines the table is `.table` so you can write the following code in the `data_retrieval.py` file:\n\n```python\nfrom selenium import webdriver\nfrom selenium.webdriver.chrome.service import Service\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\nfrom selenium.webdriver.common.by import By\nfrom selenium.common import NoSuchElementException\nimport pandas as pd\nimport os\n\n# Configure Selenium\ndriver = webdriver.Chrome(service=Service())\n\n# Target URL\nurl = \"https://finance.yahoo.com/quote/NVDA/history/?frequency=1d\u0026period1=1574082848\u0026period2=1731931014\"\ndriver.get(url)\n\n# Wait for the table to load\ntry:\n    WebDriverWait(driver, 20).until(\n        EC.presence_of_element_located((By.CSS_SELECTOR, \".table\"))\n    )\nexcept NoSuchElementException:\n    print(\"The table was not found, verify the HTML structure.\")\n    driver.quit()\n    exit()\n\n# Locate the table and extract its rows\ntable = driver.find_element(By.CSS_SELECTOR, \".table\")\nrows = table.find_elements(By.TAG_NAME, \"tr\")\n```\n\nThe above code snippet does the following:\n\n- Sets up a Selenium Chrome driver instance\n- Defines the target URL and instruct Selenium to visit it\n- Waits for the table to be loaded: In this case, the target table is loaded by Javascript, so the web driver waits 20 seconds, just to be sure the table is loaded\n- Intercepts the whole table by using the dedicated CSS selector\n\n### Step #3: Retrieve the data and save them into a CSV file\n\nNow you need to extract the headers from the table, retrieve all the data from the table, and convert the data into a [Numpy data frame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html).\n\nYou can do this with the following code:\n\n```python\n# Extract headers from the first row of the table\nheaders = [header.text for header in rows[0].find_elements(By.TAG_NAME, \"th\")]\n\n# Extract data from the subsequent rows\ndata = []\nfor row in rows[1:]:\n    cols = [col.text for col in row.find_elements(By.TAG_NAME, \"td\")]\n    if cols:\n        data.append(cols)\n\n# Convert data into a pandas DataFrame\ndf = pd.DataFrame(data, columns=headers)\n```\n\n### Step #4: Save the CSV file into the `data/` folder\n\nThe CVS file that the script generates has to be saved into the `data/` folder. Here is the code for that:\n\n```python\n# Determine the path to save the CSV file\ncurrent_dir = os.path.dirname(os.path.abspath(__file__))  \n\n# Navigate to the \"data/\" directory\ndata_dir = os.path.join(current_dir, \"../data\") \n\n# Ensure the directory exists \nos.makedirs(data_dir, exist_ok=True)  \n\n# Full path to the CSV file\ncsv_path = os.path.join(data_dir, \"nvda_stock_data.csv\")  \n\n# Save the DataFrame to the CSV file\ndf.to_csv(csv_path, index=False)\nprint(f\"Historical stock data saved to {csv_path}\")\n\n# Close the WebDriver\ndriver.quit()\n```\n\nThis code determines the (absolute) current path using the method `os.path.dirname()`, navigates to the `data/` folder with the method `os.path.join()`, ensures it exists with the method `os.makedirs(data_dir, exist_ok=True)`, saves the data to a CSV file with the method `df.to_csv()` from the Pandas library, and finally quits the driver.\n\n### Step #5: Putting it all together\n\nHere is the complete code for the `data_retrieval.py` file:\n\n```python\nfrom selenium import webdriver\nfrom selenium.webdriver.chrome.service import Service\nfrom selenium.webdriver.support.ui import WebDriverWait\nfrom selenium.webdriver.support import expected_conditions as EC\nfrom selenium.webdriver.common.by import By\nfrom selenium.common import NoSuchElementException\nimport pandas as pd\nimport os\n\n# Configure Selenium\ndriver = webdriver.Chrome(service=Service())\n\n# Target URL\nurl = \"https://finance.yahoo.com/quote/NVDA/history/?frequency=1d\u0026period1=1574082848\u0026period2=1731931014\"\ndriver.get(url)\n\n# Wait for the table to load\ntry:\n    WebDriverWait(driver, 5).until(\n        EC.presence_of_element_located((By.CSS_SELECTOR, \"table.table.yf-j5d1ld.noDl\"))\n    )\nexcept NoSuchElementException:\n    print(\"The table was not found, verify the HTML structure.\")\n    driver.quit()\n    exit()\n\n# Locate the table and extract its rows\ntable = driver.find_element(By.CSS_SELECTOR, \".table\")\nrows = table.find_elements(By.TAG_NAME, \"tr\")\n\n# Extract headers from the first row of the table\nheaders = [header.text for header in rows[0].find_elements(By.TAG_NAME, \"th\")]\n\n# Extract data from the subsequent rows\ndata = []\nfor row in rows[1:]:\n    cols = [col.text for col in row.find_elements(By.TAG_NAME, \"td\")]\n    if cols:\n        data.append(cols)\n\n# Convert data into a pandas DataFrame\ndf = pd.DataFrame(data, columns=headers)\n\n# Determine the path to save the CSV file\ncurrent_dir = os.path.dirname(os.path.abspath(__file__))  \n\n# Navigate to the \"data/\" directory\ndata_dir = os.path.join(current_dir, \"../data\") \n\n# Ensure the directory exists \nos.makedirs(data_dir, exist_ok=True)\n\n# Full path to the CSV file  \ncsv_path = os.path.join(data_dir, \"nvda_stock_data.csv\")\n\n# Save the DataFrame to the CSV file\ndf.to_csv(csv_path, index=False)\nprint(f\"Historical stock data saved to {csv_path}\")\n\n# Close the WebDriver\ndriver.quit()\n```\n\nOn Windows, launch the above script with:\n\n```powershell\npython data_retrieval.py\n```\n\nOn Linux/macOS:\n\n```bash\npython3 data_retrieval.py\n```\n\nHere is how the output scraped data appears:\n\n![The output of the scraped table](https://github.com/luminati-io/web-scraping-for-machine-learning/blob/main/images/image-55.png)\n\n## Using Machine Learning on Scraped Data\n\nLet's use the data in the CSV file in machine learning to make predictions.\n\n### Step #1: Create a new Jupyter Notebook file\n\nNavigate to the `notebooks/` folder from the main one:\n\n```bash\ncd notebooks \n```\n\nOpen a Jupyter Notebook:\n\n```bash\njupyter notebook\n```\n\nWhen the browser is open, click on **New \u003e Python3 (ipykernel)** to create a new Jupyter Notebook file:\n\n![Creating a new Jupyter Notebook file](https://github.com/luminati-io/web-scraping-for-machine-learning/blob/main/images/image-56.png)\n\nRename the file to `analysis.ipynb`.\n\n### Step #2: Open the CSV file and show the head\n\nNow you can open the CSV file containing the data and show the head of the data frame:\n\n```python\nimport pandas as pd\n\n# Path to the CSV file\ncsv_path = \"../data/nvda_stock_data.csv\"\n\n# Open the CVS file\ndf = pd.read_csv(csv_path)\n\n# Show head\ndf.head()\n```\n\nThis code goes to the `data/` folder with `csv_path = \"../data/nvda_stock_data.csv\"`. Then, it opens the CSV with the method [`pd.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) as a data frame and shows its head (the first 5 rows) with the method [`df.head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html).\n\nThis is the expected result:\n\n![The expected result](https://github.com/luminati-io/web-scraping-for-machine-learning/blob/main/images/image-57.png)\n\n### Step #3: Visualize the trend over time of the `Adj Close` value\n\nNow that the data frame is correctly loaded, you can visualize the trend of the `Adj Close` value, which represents the adjusted closing value:\n\n```python\nimport matplotlib.pyplot as plt\n\n# Ensure the \"Date\" column is in datetime forma\ndf[\"Date\"] = pd.to_datetime(df[\"Date\"])\n\n# Sort the data by date (if not already sorted)\ndf = df.sort_values(by=\"Date\")\n\n# Plot the \"Adj Close\" values over time\nplt.figure(figsize=(10, 6))\nplt.plot(df[\"Date\"], df[\"Adj Close\"], label=\"Adj Close\", linewidth=2)\n\n# Customize the plot\nplt.title(\"NVDA Stock Adjusted Close Prices Over Time\", fontsize=16) # Sets title\nplt.xlabel(\"Date\", fontsize=12) # Sets x-axis label\nplt.ylabel(\"Adjusted Close Price (USD)\", fontsize=12) # Sets y-axis label\nplt.grid(True, linestyle=\"--\", alpha=0.6) # Defines styles of the line\nplt.legend(fontsize=12) # Shows legend\nplt.tight_layout()\n\n# Show the plot\nplt.show()\n```\n\nThe above code does the following:\n\n- `df[\"Date\"]` accesses the `Date` column of the data frame and, with the method `pd.to_datetime()`, ensures that the dates are in the date format\n- The `df.sort_values()` sorts the dates of the `Date` column. This ensures the data will be displayed in chronological order.\n- `plt.figure()` sets the dimensions of the plot and `plt.plot()` displays it\n- The lines of code under the `# Customize the plot` comment are useful to customize the plot by providing the title, the labels of the axes, and displaying the legend\n- The `plt.show()` method is the one that actually allows the plot to be displayed\n\nThe expected result is something like that:\n\n![NVDA stock adjusted close prices over time example](https://github.com/luminati-io/web-scraping-for-machine-learning/blob/main/images/image-58.png)\n\nThis plot shows the actual trend of the adjusted closed values over time of the NVIDIA stocks values. The machine learning model you will be training will have to predict them as best as it can.\n\n### Step #3: Preparing data for machine learning\n\nLet's clean up and prepare the data:\n\n```python\nfrom sklearn.preprocessing import MinMaxScaler\n\n# Convert data types\ndf[\"Volume\"] = pd.to_numeric(df[\"Volume\"].str.replace(\",\", \"\"), errors=\"coerce\")\ndf[\"Open\"] = pd.to_numeric(df[\"Open\"].str.replace(\",\", \"\"), errors=\"coerce\")\n\n# Handle missing values \ndf = df.infer_objects().interpolate() \n\n# Select the target variable (\"Adj Close\") and scale the data\nscaler = MinMaxScaler(feature_range=(0, 1))  # Scale data between 0 and 1\ndata = scaler.fit_transform(df[[\"Adj Close\"]])\n```\n\nThe above code does the following:\n\n- Converts the `Volume` and `Open` values with the method `to_numeric()`\n- Handles missing values by using interpolation to fill them with the method `interpolate()`\n- Scales the data with the `MinMaxScaler()`\n- Selects and transforms (scales it) the target variable `Adj Close` with the method `fit_transform()`\n\n### Step #4: Create the train and test sets\n\nThe model used for this tutorial is an LSTM ([Long Short-Term Memory](https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts/)), which is a RNN ([Recurrent Neural Network](https://www.ibm.com/topics/recurrent-neural-networks)). You need to create a sequence of steps to allow it to learn the data:\n\n```python\nimport numpy as np\n\n# Create sequences of 60 time steps for prediction\nsequence_length = 60\nX, y = [], []\n\nfor i in range(sequence_length, len(data)):\n    X.append(data[i - sequence_length:i, 0])  # Last 60 days\n    y.append(data[i, 0])  # Target value\n\nX, y = np.array(X), np.array(y)\n\n# Split into training and test sets\nsplit_index = int(len(X) * 0.8)  # 80% training, 20% testing\nX_train, X_test = X[:split_index], X[split_index:]\ny_train, y_test = y[:split_index], y[split_index:]\n```\n\nThe above code snippet:\n\n- Creates a sequence of 60 time steps. `X` is the array of the features, `y` is the array of the target value.\n- Splits the initial data frame: 80% becomes the train set, 20% becomes the test set.\n\n### Step #5: Train the model\n\nLet's train the RNN on the train set:\n\n```python\nfrom tensorflow.keras.models import Sequential\nfrom tensorflow.keras.layers import Dense, LSTM\n\n# Reshape X for LSTM [samples, time steps, features]\nX_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))\nX_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))\n\n# Build the Sequential Neural Network\nmodel = Sequential()\nmodel.add(LSTM(32, activation=\"relu\", return_sequences=False))\nmodel.add(Dense(1))\nmodel.compile(loss=\"mean_squared_error\", optimizer=\"adam\")\n\n# Train the Model\nhistory = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), verbose=1)\n```\n\nThis code does the following:\n\n- Respahes the array of the features to be ready for the LSTM neural network by using the method `reshape()`, both for the train and test sets\n- Builds the LSTM neural network by setting its parameters\n- Fits the LSTM to the train set by using the method `fit()`\n\nIn other words, the model has now fitted the train set and it is ready to make predictions.\n\n### Step #6: Make predictions and evaluate the model performance\n\nLet's evaluate the model's performance:\n\n```python\nfrom sklearn.metrics import mean_squared_error, r2_score\n\n# Make Predictions\ny_pred = model.predict(X_test)\n\n# Inverse scale predictions and actual values for comparison\ny_test = scaler.inverse_transform(y_test.reshape(-1, 1))\ny_pred = scaler.inverse_transform(y_pred)\n\n# Evaluate the Model\nmse = mean_squared_error(y_test, y_pred)\nr2 = r2_score(y_test, y_pred)\n\n# print results\nprint(\"\\nLSTM Neural Network Results:\")\nprint(f\"Mean Squared Error: {mse:.2f}\")\nprint(f\"R-squared Score: {r2:.2f}\")\n```\n\nThis code does the following:\n\n- Inverses the values on the horizontal axis so that the data can be lately presented in chronological order. This is done with the method `inverse_transform()`.\n- Evaluates the model by using the [mean squared error](https://scikit-learn.org/1.5/modules/generated/sklearn.metrics.mean_squared_error.html) and the [R^2 score](https://scikit-learn.org/dev/modules/generated/sklearn.metrics.r2_score.html).\n\nStatistical errors are possible due to the stochastical nature of ML models. Here is the expected result:\n\n![Expected result considering the statistical errors](https://github.com/luminati-io/web-scraping-for-machine-learning/blob/main/images/image-59.png)\n\nThese results indicate that the model is good to predict the `Adj Close`.\n\n### Step #7: Compare actual vs predicted values with a plot\n\nComparing results using machine learning isn't always sufficient. Let's create a plot that compares the actual values of the `Adj Close` with the predicted ones by the LSTM model:\n\n```python\n# Visualize the Results\ntest_results = pd.DataFrame({\n    \"Date\": df[\"Date\"].iloc[len(df) - len(y_test):],  # Test set dates\n    \"Actual\": y_test.flatten(),\n    \"Predicted\": y_pred.flatten()\n})\n\n# Setting plot\nplt.figure(figsize=(12, 6))\nplt.plot(test_results[\"Date\"], test_results[\"Actual\"], label=\"Actual Adjusted Close\", color=\"blue\", linewidth=2)\nplt.plot(test_results[\"Date\"], test_results[\"Predicted\"], label=\"Predicted Adjusted Close\", color=\"orange\", linestyle=\"--\", linewidth=2)\nplt.title(\"Actual vs Predicted Adjusted Close Prices (LSTM)\", fontsize=16)\nplt.xlabel(\"Date\", fontsize=12)\nplt.ylabel(\"Adjusted Close Price (USD)\", fontsize=12)\nplt.legend()\nplt.grid(alpha=0.6)\nplt.tight_layout()\nplt.show()\n```\n\nThis code:\n\n- Sets the comparison of the actual and predicted values on the level of the test set, so the actual values have to be trimmed to the shape that the test set has. This is done with the methods `iloc()` and `flatten()`.\n- Creates the plot, adds labels to the axes, and the title, and manages other settings to improve the visualization.\n\nThe expected result is something like this:\n\n![Actual vs predicted adjusted close prices ](https://github.com/luminati-io/web-scraping-for-machine-learning/blob/main/images/image-60.png)\n\nAs the plot illustrates, the LSTM neural network's predicted values (yellow dotted line) closely match the actual values (solid blue line). While the analytical results were promising, the visualization further confirms their accuracy.\n\n### Step #8: Putting it all together\n\nHere is the complete code for the `analysis.ipynb` notebook:\n\n```python\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.metrics import mean_squared_error, r2_score\nfrom tensorflow.keras.models import Sequential\nfrom tensorflow.keras.layers import Dense, LSTM\n\n# Path to the CSV file\ncsv_path = \"../data/nvda_stock_data.csv\"  \n# Open CSV as data frame\ndf = pd.read_csv(csv_path)\n\n# Convert \"Date\" to datetime format\ndf[\"Date\"] = pd.to_datetime(df[\"Date\"])\n\n# Sort by date\ndf = df.sort_values(by=\"Date\")\n\n# Convert data types\ndf[\"Volume\"] = pd.to_numeric(df[\"Volume\"].str.replace(\",\", \"\"), errors=\"coerce\")\ndf[\"Open\"] = pd.to_numeric(df[\"Open\"].str.replace(\",\", \"\"), errors=\"coerce\")\n\n# Handle missing values \ndf = df.infer_objects().interpolate()\n\n# Select the target variable (\"Adj Close\") and scale the data\nscaler = MinMaxScaler(feature_range=(0, 1))  # Scale data between 0 and 1\ndata = scaler.fit_transform(df[[\"Adj Close\"]])\n\n# Prepare the Data for LSTM\n# Create sequences of 60 time steps for prediction\nsequence_length = 60\nX, y = [], []\n\nfor i in range(sequence_length, len(data)):\n    X.append(data[i - sequence_length:i, 0])  # Last 60 days\n    y.append(data[i, 0])  # Target value\n\nX, y = np.array(X), np.array(y)\n\n# Split into training and test sets\nsplit_index = int(len(X) * 0.8)  # 80% training, 20% testing\nX_train, X_test = X[:split_index], X[split_index:]\ny_train, y_test = y[:split_index], y[split_index:]\n\n# Reshape X for LSTM [samples, time steps, features]\nX_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))\nX_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))\n\n# Build the Sequential Neural Network\nmodel = Sequential()\nmodel.add(LSTM(32, activation=\"relu\", return_sequences=False))\nmodel.add(Dense(1))\nmodel.compile(loss=\"mean_squared_error\", optimizer=\"adam\")\n\n# Train the Model\nhistory = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), verbose=1)\n\n# Make Predictions\ny_pred = model.predict(X_test)\n\n# Inverse scale predictions and actual values for comparison\ny_test = scaler.inverse_transform(y_test.reshape(-1, 1))\ny_pred = scaler.inverse_transform(y_pred)\n\n# Evaluate the Model\nmse = mean_squared_error(y_test, y_pred)\nr2 = r2_score(y_test, y_pred)\n\n# Print results\nprint(\"\\nLSTM Neural Network Results:\")\nprint(f\"Mean Squared Error: {mse:.2f}\")\nprint(f\"R-squared Score: {r2:.2f}\")\n\n# Visualize the Results\ntest_results = pd.DataFrame({\n    \"Date\": df[\"Date\"].iloc[len(df) - len(y_test):],  # Test set dates\n    \"Actual\": y_test.flatten(),\n    \"Predicted\": y_pred.flatten()\n})\n\n# Setting plot\nplt.figure(figsize=(12, 6))\nplt.plot(test_results[\"Date\"], test_results[\"Actual\"], label=\"Actual Adjusted Close\", color=\"blue\", linewidth=2)\nplt.plot(test_results[\"Date\"], test_results[\"Predicted\"], label=\"Predicted Adjusted Close\", color=\"orange\", linestyle=\"--\", linewidth=2)\nplt.title(\"Actual vs Predicted Adjusted Close Prices (LSTM)\", fontsize=16)\nplt.xlabel(\"Date\", fontsize=12)\nplt.ylabel(\"Adjusted Close Price (USD)\", fontsize=12)\nplt.legend()\nplt.grid(alpha=0.6)\nplt.tight_layout()\nplt.show()\n```\n\nThis code goes straight to the goal, skipping initial data previews and plotting only `Adj Close` values, as these steps were covered earlier for preliminary analysis.\n\n\u003e **Note**:\n\u003e \n\u003e While the code is shown in parts, it's best to run the full code at once due to ML's stochastic nature; otherwise, the final plot may vary significantly.\n\n## Notes on Fitting an LSTM Neural Network\n\nFor simplicity, this guide focuses directly on fitting an LSTM neural network. However, in real-world ML applications, the process involves several key steps:\n\n1. **Preliminary Data Analysis**. This is the most crucial step, where you understand your data, clean NaN values, handle duplicates, and resolve any mathematical inconsistencies.\n\n2. **Training ML Models**. The first model you try may not be the best. A common approach is [spot-checking](https://machinelearningmastery.com/spot-check-machine-learning-algorithms-in-python/), which involves:\n\n   - Training 3-4 ML models on the training set and evaluating their performance.\n   - Selecting the top 2-3 models and [tuning their hyperparameters](https://scikit-learn.org/1.5/modules/grid_search.html).\n   - Comparing the best-tuned models on the test set.\n   - Choosing the highest-performing model.\n\n3. **Deployment**. The best-performing model is then deployed for production use.\n\n## Setting Up ETLs When Scraping Data for Machine Learning\n\nSaving web-scraped data as a CSV is a common practice in machine learning, especially at the start of a project when searching for the best predictive model.  \n\nOnce the best model is identified, an **ETL (Extract, Transform, Load) pipeline** is typically set up to automate data retrieval, cleaning, and storage.\n\nHere is the ETL Process for ML Workflows:\n\n- **Extract**: Retrieve data from various sources, including web scraping.  \n- **Transform**: Clean and prepare the collected data.  \n- **Load**: Store the processed data in a database or data warehouse.  \n\nOnce stored, the data is integrated into the ML workflow to **re-train and re-validate the model** with new data.\n\n## Conclusion\n\nNeed data for machine learning but not familiar with web scraping?  [Check out our solutions for efficient data retrieval](https://brightdata.com/use-cases/data-for-ai).  \n\nSign up for a free Bright Data account to try our scraper APIs or explore our datasets.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fweb-scraping-for-machine-learning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fluminati-io%2Fweb-scraping-for-machine-learning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fluminati-io%2Fweb-scraping-for-machine-learning/lists"}