{"id":20709953,"url":"https://github.com/oxylabs/web-scraping-machine-learning","last_synced_at":"2025-12-16T07:00:12.297Z","repository":{"id":134336683,"uuid":"526101112","full_name":"oxylabs/web-scraping-machine-learning","owner":"oxylabs","description":"Web Scraping for Machine Learning","archived":false,"fork":false,"pushed_at":"2025-09-24T09:22:42.000Z","size":142,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-13T18:33:18.667Z","etag":null,"topics":["ai-web-scraper","github-python","machine-learning","machine-learning-web-scraper","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oxylabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-08-18T07:20:28.000Z","updated_at":"2025-11-02T04:03:20.000Z","dependencies_parsed_at":"2024-04-19T12:29:38.273Z","dependency_job_id":"197724da-32ad-472d-a2fd-4722d21aac8d","html_url":"https://github.com/oxylabs/web-scraping-machine-learning","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/oxylabs/web-scraping-machine-learning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-machine-learning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-machine-learning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-machine-learning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-machine-learning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oxylabs","download_url":"https://codeload.github.com/oxylabs/web-scraping-machine-learning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oxylabs%2Fweb-scraping-machine-learning/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":27760424,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-16T02:00:10.477Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-web-scraper","github-python","machine-learning","machine-learning-web-scraper","python"],"created_at":"2024-11-17T02:09:14.804Z","updated_at":"2025-12-16T07:00:12.274Z","avatar_url":"https://github.com/oxylabs.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scraping for Machine Learning\n\n[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877\u0026utm_medium=affiliate\u0026groupid=877\u0026utm_content=web-scraping-machine-learning-github\u0026transaction_id=102f49063ab94276ae8f116d224b67)\n\n[![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge\u0026theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge\u0026logo=youtube\u0026logoColor=white)](https://www.youtube.com/@oxylabs)\n\n\nYou can see the full guide on our [blog](https://oxylabs.io/blog/web-scraping-for-machine-learning).\n\n## Project requirements\n\n```bash\n$ python3 -m pip install requests_html beautifulsoup4\n```\n\n```bash\n$ python3 -m pip install pandas numpy matplotlib seaborn tensorflow scikit-learn keras\n```\n\n## Extracting the data\n\nIf we’re looking at machine learning projects, Jupyter Notebook is a great choice as it’s easier to run and rerun a few lines of code. Moreover, the plots are in the same Notebook.\n\nBegin with importing required libraries as follows:\n\n```python\nfrom requests_html import HTMLSession\nimport pandas as pd\n```\n\nFor web scraping, we only need `Requests-HTML`. The primary reason is that `Requests-HTML` is a powerful library that can handle all our web scraping tasks, such as extracting the HTML code from websites and parsing this code into Python objects. Further benefits come from the library’s ability to function as an HTML parser, meaning collecting data and labeling can be performed using the same library. \n\nNext, we use Pandas for loading the data in a DataFrame for further processing.\n\nIn the next cell, create a session and get the response from your target URL.\n\n```python\nurl = 'https://finance.yahoo.com/quote/AAPL/history?p=AAPL\u0026guccounter=1\u0026period1=1556113078\u0026period2=1713965616'\nsession = HTMLSession()\nr = session.get(url)\n```\n\nAfter this, use XPath to select the desired data. It’ll be easier if each row is represented as a dictionary where the key is the column name. All these dictionaries can then be added to a list.\n\n```python\nrows = r.html.xpath('//table/tbody/tr')\nsymbol = 'AAPL'\ndata = []\nfor row in rows:\n    if len(row.xpath('.//td')) \u003c 7:\n        continue\n    data.append({\n        'Symbol':symbol,\n        'Date':row.xpath('.//td[1]/text()')[0],\n        'Open':row.xpath('.//td[2]/text()')[0],\n        'High':row.xpath('.//td[3]/text()')[0],\n        'Low':row.xpath('.//td[4]/text()')[0],\n        'Close':row.xpath('.//td[5]/text()')[0],\n        'Adj Close':row.xpath('.//td[6]/text()')[0],\n        'Volume':row.xpath('.//td[7]/text()')[0]\n    }) \ndf = pd.DataFrame(data)\n```\n\nThe results of web scraping are being stored in the variable data. To understand why such actions are taken, we must consider that these variables are a list of dictionaries that can be easily converted to a data frame. Furthermore, completing the steps mentioned above will also help to complete the vital step of data labeling.\n\n![](https://images.prismic.io/oxylabs-sm/OGFjNzk2M2YtN2FlOS00YWY2LWFiMzEtOTM2YTBkMGZjYmM5_initial_dataframe.png?auto=compress,format\u0026rect=0,0,2237,498\u0026w=2237\u0026h=498\u0026fm=webp\u0026dpr=2\u0026q=50)\n\nThe provided example’s data frame is not yet ready for the machine learning step. It still needs additional cleaning.\n\n## Cleaning the data\n\nNow that the data has been collected using web scraping, we need to clean it up. The primary reason for this action is uncertainty whether the data frame is acceptable; therefore, it’s recommended to verify everything by running `df.info()`.\n\n![](https://images.prismic.io/oxylabs-sm/NmZiMzFkNjctYmE2MS00YTc5LWE3ZTQtOWU5YzBmNTZkZWZj_df_info.png?auto=compress,format\u0026rect=0,0,2240,649\u0026w=2240\u0026h=649\u0026fm=webp\u0026dpr=2\u0026q=50)\n\nAs evident from the above screen-print, all the columns have data type as object. For machine learning algorithms, these should be numbers.\n\nDates can be handled using `Pandas.to_datetime`. It’ll take a series and convert the values to `datetime`. This can then be used as follows:\n\n```python\ndf['Date'] = pd.to_datetime(df['Date'])\n```\n\nThe issue we ran into now is that the other columns were not automatically converted to numbers because of comma separators. \n\nThankfully, there are multiple ways to handle this. The easiest one is to remove the comma by calling `str.replace()` function. The astype function can also be called in the same line which will then return a `float`.\n\n```python\nstr_cols = ['High', 'Low', 'Close', 'Adj Close', 'Volume']\ndf[str_cols]=df[str_cols].replace(',', '', regex=True).astype(float)\n```\n\nFinally, if there are any `None` or `NaN` values, these can be deleted by calling the `dropna()`.\n\n```python\ndf.dropna(inplace=True)\n```\n\nAs the last step, set the `Date` column as the index and preview the data frame.\n\n```python\ndf = df.set_index('Date')\ndf.head()\n```\n\n![](https://images.prismic.io/oxylabs-sm/ZmY1ODUxYzUtZGY0Yy00M2M0LWIzNzUtODhkYjBhYjQwMWJl_clean_dataframe.png?auto=compress,format\u0026rect=0,0,2242,541\u0026w=2242\u0026h=541\u0026fm=webp\u0026dpr=2\u0026q=50)\n\nThe data frame is now clean and ready to be sent to the machine learning model.\n\n## Visualizing the data\n\nBefore we begin the section on machine learning, let’s have a quick look at the closing price trend.\n\nFirst, import the packages and set the plot styles:\n\n```python\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nsns.set_style('darkgrid')\nplt.style.use('ggplot')\n```\n\nNext, enter the following lines to plot the `Adj Close`, which is the adjusted closing price\n\n```python\nplt.figure(figsize=(15, 6))\ndf['Adj Close'].plot()\nplt.ylabel('Adj Close')\nplt.xlabel(None)\nplt.title('Closing Price of AAPL')\nplt.show()\n```\n\n![](https://images.prismic.io/oxylabs-sm/NTA2ZGQxZmUtNWZkMi00ODQzLTljMTAtMGUyNTEyZGJiZGZj_closing_price_aapl.png?auto=compress,format\u0026rect=0,0,889,351\u0026w=889\u0026h=351\u0026fm=webp\u0026dpr=2\u0026q=50)\n\n## Preparing data for machine learning\n\nThe first step to machine learning is the selection of features and values we want to predict. \n\nIn this example, the “Adjusted Close” is dependent on the “Close” part. Therefore, we’ll ignore the `Close` column and focus on `Adj Close`.\n\nThe features are usually stored in a variable named `X` and the values that we want to predict are stored in a variable `y`.\n\n```python\nfeatures = ['Open', 'High', 'Low', 'Volume']\ny = df.filter(['Adj Close'])\n```\n\nThe next step we have to consider is feature scaling. It’s used to normalize the features, i.e., the independent variables. Within our example, we can use `MinMaxScaler`. This class is part of the preprocessing module of the Sci Kit Learn library.\n\nFirst, we’ll create an object of this class. Then, we’ll train and transform the values using the `fit_transform` method as follows:\n\n```python\nfrom sklearn.preprocessing import MinMaxScaler\nscaler = MinMaxScaler()\nX = scaler.fit_transform(df[features])\n```\n\nThe next step is splitting the data we have received into two datasets, test and training.\n\nThe example we’re working with today is a time-series data, meaning data that changes over a time period requires specialized handling. The `TimeSeriesSplit` function from SKLearn’s `model_selection` module will be what we need here.\n\n```python\nfrom sklearn.model_selection import TimeSeriesSplit\ntscv = TimeSeriesSplit(n_splits=10) \nfor train_index, test_index in tscv.split(X):\n    X_train, X_test = X[train_index], X[test_index]\n     y_train, y_test = y.iloc[train_index], y.iloc[test_index]\n```\n\nOur approach for today will be creating a neural network that uses an LSTM or a Long Short-Term Memory layer. LSTM expects a 3-dimensional input with information about the batch size, timesteps, and input dimensions. We need to reshape the features as follows:\n\n```python\nX_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])\nX_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])\n```\n\n## Training the model and predictions\nWe’re now ready to create a model. Import the `Sequential` model, `LSTM` layer, and `Dense` layer from Keras as follows:\n\n```python\nfrom keras.models import Sequential\nfrom keras.layers import LSTM, Dense\n```\n\nContinue by creating an instance of the Sequential model and adding two layers. The first layer will be an LSTM with 32 units while the second will be a Dense layer.\n\n```python\nmodel = Sequential()\nmodel.add(LSTM(32, activation='relu', return_sequences=False))\nmodel.add(Dense(1))\nmodel.compile(loss='mean_squared_error', optimizer='adam')\n```\n\nThe model can be trained with the following line of code:\n\n```python\nmodel.fit(X_train, y_train, epochs=100, batch_size=8)\n```\n\nWhile the predictions can be made using this line of code:\n\n```python\ny_pred= model.predict(X_test)\n```\n\nFinally, let’s plot the actual values and predicted values with the following:\n\n```python\nplt.figure(figsize=(15, 6))\nplt.plot(y_test.values, label='Actual Value')\nplt.plot(y_pred, label='Predicted Value')\nplt.ylabel('Adjusted Close (Scaled)')\nplt.xlabel('Time Scale')\nplt.legend()\n```\n\n![](https://images.prismic.io/oxylabs-sm/NTE5ZGFkMDUtN2U4Ni00ZmZjLTkwNDEtNjYxYzZmY2NkZjhl_predictions.png?auto=compress,format\u0026rect=0,0,889,370\u0026w=889\u0026h=370\u0026fm=webp\u0026dpr=2\u0026q=50)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fweb-scraping-machine-learning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foxylabs%2Fweb-scraping-machine-learning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foxylabs%2Fweb-scraping-machine-learning/lists"}