{"id":26870786,"url":"https://github.com/amidug/brazilpipeline","last_synced_at":"2026-04-10T17:05:30.887Z","repository":{"id":280534231,"uuid":"942298078","full_name":"AmiDug/BrazilPipeline","owner":"AmiDug","description":null,"archived":false,"fork":false,"pushed_at":"2025-03-30T21:39:00.000Z","size":1654,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-30T22:27:22.043Z","etag":null,"topics":["fastapi","keras","mlflow","python","scikit-learn","tensorflow","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AmiDug.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-03T22:18:02.000Z","updated_at":"2025-03-30T21:39:03.000Z","dependencies_parsed_at":"2025-03-30T22:24:16.813Z","dependency_job_id":null,"html_url":"https://github.com/AmiDug/BrazilPipeline","commit_stats":null,"previous_names":["amidug/brazilpipeline"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmiDug%2FBrazilPipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmiDug%2FBrazilPipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmiDug%2FBrazilPipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmiDug%2FBrazilPipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AmiDug","download_url":"https://codeload.github.com/AmiDug/BrazilPipeline/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246429494,"owners_count":20775808,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastapi","keras","mlflow","python","scikit-learn","tensorflow","xgboost"],"created_at":"2025-03-31T07:17:06.509Z","updated_at":"2025-12-30T23:15:45.794Z","avatar_url":"https://github.com/AmiDug.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Description:\n\nGPU-accelerated ML pipeline for price prediction based on this Kaggle dataset: https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce?select=olist_order_payments_dataset.csv\n\nUses SciKit-Learn, Tensorflow, Keras and XGBoost models.\n\nOpen a local mlflow tracking server from (.venv): MLflow server --backend-store-uri sqlite:///instance/mlflow.db --default-artifact-root ./mlflow-artifacts --host 127.0.0.1 --port 8080\n\nServe a consumable FastAPI endpoint from (.venv): uvicorn api.main:app --host 0.0.0.0 --port 8000\n\nYou can run the pipeline from the API using curl commands or by visiting http://localhost:8000/docs\n\nReport:\n\nThe pipeline will use both supervised and deep learning models to train different models so that we can compare their performance on the dataset.\n\nThe components of the pipeline are a data ingestion step that retrieves the datasets from the Kaggle API, merges them and then creates a Pandas dataframe.\n\nThe dataframe gets passed to data validation which ensures valid data types, values and outliers. If the data passes these checks then MLflow visualization charts are created.\n\nData transformation converts categorical values to numerical ones, imputes unknown values, removes price outliers, and does feature engineering.\n\nFollowing transformation the data is split into training (78,707 samples) and testing (33,732 samples) sets using a 70/30 ratio with controlled randomization to ensure reproducibility.\n\nModel training creates 4 different models. A decision tree with a depth of 12. A random forest with a depth of 15 and 200 trees, a Keras neural network with batch normalization, dropout regularization, early stopping and learning rate scheduling to prevent overfitting. Finally there is an XGBoost model that is GPU-accelerated with a depth of 12, 500 boosting rounds and built-in cross-validation.\n\nFinally, model evaluation calculates each model's performance with metrics such as R^2, RMSE, MAE and MAPE, it also logs and creates artifacts in MLflow.\n\nThe datasets consists of around 100k transactions made between the years 2016 and 2018 from the Brazilian E-commerce website Olist[1].\n\nThere are many 9 different datasets including for customers, geolocation, order items, order payments, order reviews, orders, products, sellers and product categories.\n\nAll 9 datasets will be joined for the purposes of training the model and extracting valuable information regarding which features are useful for predicting the price of products during transactions.\n\nPrice prediction is very important for e-commerce websites to make sure they can maximize profits for the products they are selling in a data-informed way, if a bad price is given for a product\nit can either lead to a net loss for the company or the price may be set so high that no transaction will be made at all, if the price is too high it may even lead to\nclaims of price gouging for the retailer which affects reputation and may even lead to legal action against the corporate entity.\n\nAccording to our pipeline the merged datasets consist of 112650 rows with 52 columns, the disk size is 126.19 MB. 19 of the features were numerical while the rest were categorical.\n\nThe data was cleaned through various means. 8427 price outliers were identified and and 211 in the 99.9th percentile were removed, \nextreme outliers can overly bias the data despite making up a small amount of the total number of entries.\n\nThe price distribution following this step can be seen in this image:\n\n![price_distribution](https://raw.githubusercontent.com/AmiDug/BrazilPipeline/refs/heads/master/documents/price_distribution.png)\n\nThere were also 10225 product-order combinations that were duplicates and were therefore removed, duplicate data allows a single data entry to have several times the training impact that it should have.\n\nCategorical features such as product_category_name_english, customer_state, seller_state and payment_type had missing values that were imputed with \"unknown\" values rather than dropping the entire feature,\nthis preserves the data volume while not letting missing value affect the results.\n\nCategorical features were also converted to numerical represented ones that are more appropriate for mathematical machine models through label encoding.\n\nMost features were determined to be non-predictive and were either merged with other features to create something with predictive power or in most cases simply dropped, \nout of 52 columns the training only happened with 16 features, engineered or otherwise.\n\nAn Exploratory Data Analysis or EDA was performed and there were some interesting findings.\n\nFirstly there was significant variation in pricing that was based on product categories, in particular electronics and furniture commanded higher prices than other categories.\n\n![category_distribution](https://raw.githubusercontent.com/AmiDug/BrazilPipeline/refs/heads/master/documents/category_distribution.png)\n\nGeographic analysis revealed that certain regions such as São Paulo and Rio de Janeiro were more represented than others.\n\n![state_distribution](https://raw.githubusercontent.com/AmiDug/BrazilPipeline/refs/heads/master/documents/state_distribution.png)\n\nFeature importance analysis showed that there are significant relationships between price and the volume and weight of the cargo.\n\n![feature_importance](https://raw.githubusercontent.com/AmiDug/BrazilPipeline/refs/heads/master/documents/xgb_feature_importance.png)\n\nFeature engineering was employed to merge features together to create better correlations. An example is the creation of a volume feature that combined the length, height and width of a product. The volume feature turned out to be highly correlated to price.\nThere were also some features that were clear examples of data leaks such as payment amount which will obviously be very highly correlated with the product price, the difference only consisting of shipment costs, these were removed.\n\nFour models were employed, a decision tree, random forest, XGBoost and a neutral network.\n\n![model summary](https://raw.githubusercontent.com/AmiDug/BrazilPipeline/refs/heads/master/documents/model_summary.png)\n\nDecision Tree:\n\n![predicted vs actual](https://github.com/user-attachments/assets/fe2c6eee-3fb7-4cf6-b6bb-66badb26db8f)\n\nRandom Forest:\n\n![predicted vs actual](https://github.com/user-attachments/assets/6c6fcdb6-f6db-456a-98cb-53075e5dd418)\n\nNeural Network:\n\n![predicted vs actual](https://github.com/user-attachments/assets/cfcaff43-ec58-4ded-865b-0478fee47637)\n\n![training history](https://github.com/user-attachments/assets/98bb06d4-c505-4463-91b1-825882b4baaa)\n\nXGBoost:\n\n![error by price range](https://github.com/user-attachments/assets/9a480d9c-89f8-4117-8fd6-49f0705b3cb2)\n\n![predicted vs actual](https://github.com/user-attachments/assets/06b9058c-7fcf-4e02-aa1b-5480c974c342)\n\nXGBoost had the best accuracy at 76.8%. This model acquired a result of 58% before tuning because it would take too long to train it properly on a CPU, so computation was switched to a GPU and amount of rounds were increased which gave a better accuracy.\nXGBoost's inbuilt cross-validation was also used in order to get several samplings of the data to make sure that the accuracy would be fairly consistent across different parts of the dataset.\n\nThe results imply decision tree is too simple of a model for such a vast dataset with intricately linked and complex features. Ensemble approaches such as gradient boosting and random forest works better for this type of data because they can split up the \ntraining workload into multiple trees, the average of these trees can lead to an analysis that will yield a better result than simply doing hard binary split in each node like decision trees do since this reduces variance and overfitting. Neural networks also did decently since there were a lot of non-linear patterns that could be recognized, however neural networks struggle with tabular data and prefer unstructured datasets.\n\nOther metrics such as RMSE, MAE and MAPE were roughly in line with the accuracy so there doesn't seem to be anything unusual about them.\n\nIn conclusion I believe that the biggest takeaway from this report is that sometimes results are limited by hardware rather than something being wrong with the model and that you need to give very complex and large datasets enough computational\npower such as with a GPU to make the training meaningful. I believe I could have tried to improve and tune the random forest model more and it is possible it could do as well as the gradient boosting method however I mostly focused on gradient boosting since it gave\nme the best results when I was solely using my local PC's CPU computation. I also learned that deep learning models don't work as well as I thought they would on tabulated data but prefer unstructured data which is something I will consider when choosing models in the future.\n\n[1]Olist. (n.d.). Brazilian E-Commerce Public Dataset by Olist [Data set]. Kaggle. https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce\n\nClaude 3.7 helped in answering some questions I had, commenting the code and with troubleshooting\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famidug%2Fbrazilpipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famidug%2Fbrazilpipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famidug%2Fbrazilpipeline/lists"}