A full machine learning pipeline and interactive dashboard to analyze and forecast Walmart sales. Combines model experimentation with an intuitive Streamlit app for exploring weekly sales trends and predictions.
- 📌 Project Overview
- 📊 Dataset
- 🔄 Pipeline & Workflow
- 📈 Insights
- 💻 Usage Examples
- 🤝 Contributing
- 📬 Contact
This project analyzes Walmart’s sales data and forecasts future weekly sales using machine learning. It includes:
- A Jupyter Notebook that trains multiple models (Random Forest, XGBoost, LightGBM, ...) using a custom time-series pipeline.
- A Streamlit dashboard to explore sales data and compare predictions interactively, allowing users to filter by store, department, time range and also ranking from best to worst based on total sales, $ Growth and % Growth.
- Model Training: Historical sales and market data from
https://www.kaggle.com/competitions/walmart-recruiting-store-sales-forecasting/overview,yfinance,pandas-datareader, andakshare. Also some columns for holidays and special events or Tax return used a ramp up and down do represent importance. - Dashboard: Pre-processed CSV file
df_wm_store_sales_predictions.csv, which contains weekly sales and predictions.
| Data Type | Source | Description |
|---|---|---|
| Stock Data | yfinance / pandas-datareader |
Weekly Walmart stock prices and economic indicators |
| Processed Dataset | Local CSV (df_wm_store_sales_predictions.csv) |
Sales & predictions used in the dashboard |
- Data Fetching – Collect WMT stock and macroeconomic indicators
- Feature Engineering – Create time-aware features and lags
- Time-Series CV –
TimeSeriesSplitwith performance tracking - Modeling – Train Random Forest, XGBoost, LightGBM, ...
- Interpretation – SHAP values and permutation importance ( not completed because of lack of compute power and time constrains
app.py: Main controller for layout, interaction, routingdata_loader.py: Loads cached data using@st.cache_datafilters.py: Applies store/department/date filtersmetrics.py: Calculates KPIs (sales totals, growth, date ranges)ui_components.py: Charts, grids, headers, KPIs, footers
- The Random Forest model accurately captures general sales trends.
- SHAP analysis would highlights the most impactful features on predictions.
- Dynamic visual tools make it easy to identify underperforming stores or departments.
- 🔧 Hyperparameter Tuning: Optimize Random Forest for better accuracy ( if possible get more data on departments and have daily or hourly sales instead of weekly)
- 📊 More Visuals: Add SHAP force plots
- For notebook usage just ran all cells. If you want to run on kaggle just delete the """ in the first cell. ( note that there is a function called "%%skip". This function was used to run all cell but those that start with that.
- To check streamlit app just run the streamlit.bat found in the main directory.
- Fork the repo
- Create a feature branch (
git checkout -b feature/YourFeature) - Commit your changes (
git commit -m "Add new analysis") - Push (
git push origin feature/YourFeature) - Open a Pull Request
- Email: [email protected]
- GitHub: github.com/JorgeMMLRodrigues
Feel free to open issues or discussions!



