{"id":24520448,"url":"https://github.com/arnauog/van_price_predictor","last_synced_at":"2026-04-18T04:03:09.455Z","repository":{"id":271943348,"uuid":"915043606","full_name":"arnauog/Van_Price_Predictor","owner":"arnauog","description":"3rd bootcamp project. Wanna buy a van? Let's predict van prices and analyze which factors contribute the most to its price! https://vanpricepredictor.streamlit.app/","archived":false,"fork":false,"pushed_at":"2026-02-03T16:15:56.000Z","size":72034,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-02-04T06:01:42.970Z","etag":null,"topics":["machine-learning","pandas","python","seaborn","streamlit","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arnauog.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-10T21:01:08.000Z","updated_at":"2026-02-03T16:30:23.000Z","dependencies_parsed_at":"2025-10-24T11:21:25.385Z","dependency_job_id":null,"html_url":"https://github.com/arnauog/Van_Price_Predictor","commit_stats":null,"previous_names":["arnauog/van_price_predictor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/arnauog/Van_Price_Predictor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arnauog%2FVan_Price_Predictor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arnauog%2FVan_Price_Predictor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arnauog%2FVan_Price_Predictor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arnauog%2FVan_Price_Predictor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arnauog","download_url":"https://codeload.github.com/arnauog/Van_Price_Predictor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arnauog%2FVan_Price_Predictor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31955920,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T00:39:45.007Z","status":"online","status_checked_at":"2026-04-18T02:00:07.018Z","response_time":103,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","pandas","python","seaborn","streamlit","webscraping"],"created_at":"2025-01-22T02:22:35.963Z","updated_at":"2026-04-18T04:03:04.442Z","avatar_url":"https://github.com/arnauog.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Van price prediction\r\n\r\n![](images/00-header.jpg)\r\n\r\nThis project had two requirements: we had to use Machine Learning, and we had to get the data fully or at least partly webscraping.\r\n\r\nChosing the right topic was crucial: my priority was to chose a topic where I could get a good model, 'cause if after all the data cleaning I couldn't get a good model, all the work would have been in vain.\r\n\r\nI thought about different topics that interest me, and in the end I though about predicting the price of a van, because all or almost all the variables have an effect on the price: brand, model, age, kms, fuel, seats, etc.\r\n\r\nhttps://vanpricepredictor.streamlit.app/\r\n\r\nHere's the link to the final predictor.\r\n\r\n# **Webscraping**\r\n\r\nThe best website I could find was a german one (https://www.mobile.de/es), because it had the information very structured, unlike others, and I could apply a lot of filters to find the vehicles I wanted.\r\n\r\nSo I analyzed the information I could find in the general view and inside of every van, and I decided I had to enter every vehicle to get all the information I wanted. \r\n\r\n**Variables** - here is the information I wanted to get and the filters I applied:\r\n\r\n``title``: original title where I can get the ``brand`` and ``model``\r\n\r\n``price``: including the VAT (IVA in Spain)\r\n\r\n``year``: I get it from the first registration of the van\r\n\r\n``fuel``: only Diesel and Gasoline, which make almost 98% of the vehicles of the website\r\n\r\n``km``\r\n\r\n``power``: the horsepower in cv, missing in some vans.\r\n\r\n``displacement``: the engine in cm3. I put a filter of minimum 1000 cm3 so I can always get the info, because I found out that in some vehicles they don't show it. It's related to the horsepower, maybe I don't need it, but I want to get it and then decide.\r\n\r\n``consumption``: in L/100km, missing in some vehicles. I put a minimum of 3L/100km as a filter.\r\n\r\n``seats``: missing sometimes, I filter to minimum 2 seats.\r\n- 2: small cargo van\r\n- 3: big cargo van\r\n- 4: big passenger van, sometimes camperized, with the kitchen occupying one seat\r\n- 5: small passenger van\r\n\r\n``owners``: previous owners. Missing sometimes, I put a filter so I always get this info. \r\n\r\n``sliding doors``: mark any option (right, left and both-sided) to exclude cars. Vans either have or don't have a sliding left door, so actually left and both-sided should be the same. \r\n\r\n**Filters** - Some other filters I applied:\r\n- Second hand vehicles to exclude new vehicles.\r\n- Type: van\r\n- Adds with images (in case I need to check information later, like the model)\r\n- Damaged vehicles: don't show\r\n\r\n## Webscraping Code\r\nI had more challenges than what I expected, because the information was not perfectly structured, like the title was in two different formats.\r\n\r\nLike I said, some information was missing, like the horsepower and the consumption, so I appended a NaN everytime it couldn't find it. \r\n\r\nFiltering for vans with only the rear right door, I had around 29.000 vans. Inside the specific van advertisement, I could change to the next van, so theorically I could run the code only once to scrape all 29.000 vans, but the reality was that when it reached the van number 2.000, it couldn't find the next button, so I had to run the code again.\r\n\r\n![](images/01-next-button-not-found-2000.png)\r\n\r\nI scrapped the vans in descending number of km, so I could track if it was trying to scrape a van already in my df, and I made it stop scraping it this happened so I wouldn't waste time, if accidentaly it had changed to a previous page, and also I knew from which van to continue just calculating the ``df.km.min()``\r\n\r\nI compared if the van I was trying to scrape was already in df, and I skipped it in these cases, which happened quite often.\r\n\r\nWhenever I finished one round of scraping, I saved the info into a df, which then I concatenated with the previous df that contained all the vans I had scrapped previously. \r\n\r\nI decided to scrape vans with more than 10.000km to put a limit, and I ended up with 21.738 vans. \r\n\r\n# **Data Cleaning**\r\nLike I said before, from the date I could get the year, and then the ``age``, the variable which I'm really interested in.\r\n\r\nI had only 7 nulls of ``power_cv`` and 2828 nulls of ``consumption``, which represent 13% of the dataset.\r\n\r\n### ``title``\r\n![](images/02-original-title.png)\r\n\r\nFrom the title I got the ``brand`` and ``model`` by assuming that in all the ads the first word is the brand and the second one is the model. In most of the cases it's like this, but not in all of them, so I had to check brand by brand the different models. \r\n\r\nHere's an example of the different models in the beginning and after the clean-up.\r\n\r\n**Before**: \r\n\r\n![](images/03-different-models-wrong.png)\r\n\r\n**After**: \r\n\r\n![](images/03-different-models-cleaned.png)\r\n\r\nThe worst brand was **Ford**, because in most of the cases the model is not determined by the second word of the title. The most common second words were Transit and Tourneo, but both of these can vary in sizes (weither if it's Connect, Custom, etc.), therefore exisiting many different models. For the sake of simplicity, and because later I want to group again the models by size, I set 3 different models of Ford depending on the size: Transit, Custom and Connect.\r\n\r\nHere is a picture with 4 different models, the 2 on the left being considered the same model, since they are roughly the same size.\r\n\r\n![](images/Ford-van-sizes.jpeg)\r\n\r\nI drop 4 brands that only had one model each and less than 4 vehicles per brand, not representative, I prefer simplifying the dataset, and also models with 6 or less vans.\r\n\r\nIn the beggining I had models from 24 brands, and I ended up with only **48 models from 12 brands.**\r\n\r\nI started with 18.636 vehicles and finished with 17.686, meaning I dropped almost a thousand vehicles, which either they were cars or I couldn't find them again on the website.\r\n\r\n## ``consumption``\r\n\r\nI find there is a '--' value and many outliers that are wrong, so I change them manually looking for vans of the same characteristics. Let's take a look at the boxplot before and after:\r\n\r\n![](images/08-consumption-boxplot-before.png)\r\n![](images/08-consumption-boxplot-after.png)\r\n\r\n## ``seats``\r\n\r\nSame thing, I take a closer look at the outliers and I see that some of the values are wrong. \r\n\r\n![](images/09-seats-boxplot-after.png)\r\n\r\n![](images/09-seats-countplot.png)\r\n\r\nMost of the vans have 5, 3 or 7 seats.\r\n\r\n5 seats: Standard passenger van\r\n\r\n3 seats: big cargo van\r\n\r\n7 seats: big passenger van\r\n\r\n## ``owners``\r\n\r\nThis is the only variable where the code failed, where it took values from another characteristic, so I have to check one by one to get the right values.\r\n\r\n![](images/10-owners-unique.png)\r\n\r\nMost of the vans (71%) have had only one owner.\r\n\r\n![](images/10-owners-value_counts.png)\r\n\r\n# **Exploratory Data Analysis**\r\n\r\n### ``brand`` and ``model``: \r\n\r\nMost frequent models: \r\n\r\n![](images/04-most-common-models.png)\r\n\r\nNumber of different models per brand:\r\n\r\n![](images/04-models-per-brand.png)\r\n\r\nNumber of vans per brand:\r\n\r\n![](images/04-vans-per-brand.png)\r\n\r\n### ``fuel``\r\n\r\nMost of the vans (91,79%) run with diesel.\r\n\r\n![](images/07-fuel-value_counts.png)\r\n\r\n\r\n# **Data preprocessing**\r\n\r\nBefore advancing any further, I generate some graphs like scatterplots and a heatmap in order to find odd values, and generate some predictions.\r\n\r\nIn doing so, I realize that some predicted prices are negative. After some digging, I found that these vans with predicted negative prices are mostly due to one or more of the following reasons: \r\n- High mileage\r\n- Old age\r\n- Low horsepower\r\n\r\n![](images/11_predicted_negative_prices.png)\r\n\r\nOf course, the model also tends to predict a negative price for vans that are actually cheap.\r\n\r\n## Numerical variables\r\n\r\n### ``price``\r\n\r\nI had some outliers in the upper end, so I decide to drop them (only 9 vans). I also drop the 1% cheapest vans.\r\n\r\nLet's take a look at the boxplot before and after:\r\n\r\n![](images/05-price-boxplot-before.png)\r\n![](images/05-price-boxplot-after.png)\r\n\r\nThe mean price is **29.578€**\r\n\r\n### ``km``\r\n\r\nSame thing, I had some outliers in the upper end, in this case I decide to drop only one van. \r\n\r\n![](images/12_km_scatterplot.png)\r\n\r\nLike I said before, thanks to the scatterplot I can find a van that its price doesn't follow the same patterns as the other vans, so I drop it as well. Let's take a look at the boxplot before and after:\r\n\r\n![](images/06-km-boxplot-before.png)\r\n![](images/06-km-boxplot-after.png)\r\n\r\nThe mean mileage is **106.862km**\r\n\r\nI do the same thing with ``age`` and ``power_cv``\r\n\r\n## Categorical variables\r\n\r\nI get dummies of the following features:\r\n- ``fuel``: Diesel or Gasoline \r\n- ``owners``: one or more\r\n- ``sliding_doors``: right or both-sided\r\n\r\nGoing back to ``seats``, from the number of seats I can tell if a van is a cargo van (2 or 3 seats) or a passenger (4 or more seats), so I can get another dummy, ``cargo``.\r\n\r\nThanks to the extensive work I did on the brand and model, I can get 3 new categories: ``brand_price``, ``model_price`` and ``van_size``. It doesn't make sense the get dummies out of all the brands or models (remember, 12 brands and 48 models), but instead I separate the brands and models in 3 categories depending on their mean price, which I get from a groupby.\r\n\r\n![](images/13-brand-per-price.png) ![](images/13-brand-per-price_groupby.png)\r\n\r\n![](images/13-brand-per-price-code.png)\r\n\r\nI do this because an expensive brand can also have affordable models, and viceversa, usually there is a model for every budget.\r\n\r\n![](images/16-models-per-price_groupby.png)\r\n\r\nFrom the model I can also separate the vans in 3 different sizes.\r\n\r\nWith all these dummies I generate barplots against the price to see how they affect it. Here are some examples. \r\n\r\n![](images/14-price-per-cargo.png) ![](images/14-price-per-fuel.png) ![](images/14-price-per-doors.png) \r\n\r\nWith the heatmap we can observe the relationship between the distinct features and the price. \r\n\r\n![](images/15-heatmap.png)\r\n\r\n# Machine Learning model\r\n\r\nI train and test the following models: OLS, Ridge, Lasso, Polynomial Features, KNN and SVR. \r\n\r\nAt the very beginning of the project I ran a simple OLS model to check the initial performance, and I got the following results:\r\n\r\n- R2: 0.73272\r\n- MSE: 96434466.68379\r\n- MAPE: 41.14608%\r\n\r\nAfter all the processing, with the same OLS I got these results:\r\n\r\n- R2: 0.87235\r\n- MSE: 43671531.53351\r\n- MAPE: 25.48113%\r\n\r\nWe can see a clear improvement. I test all the models with the data untreated and scaled, and I can only see a difference in the KNN, which is the one that gives me the best results:\r\n\r\n- R2: 0.93136\r\n- MSE: 24085983.59092\r\n- MAPE: 14.88392%\r\n\r\nIn making the predictions once again, I am glad to see that there are no negative predicted prices, and the distributions of the predicted prices and actual prices are very similar. \r\n\r\n![](images/17-histplot-price-real.png)\r\n\r\n![](images/17-histplot-price-predicted.png)\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnauog%2Fvan_price_predictor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farnauog%2Fvan_price_predictor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farnauog%2Fvan_price_predictor/lists"}