From raw flight data → EDA → Feature Engineering → Model Training → Evaluation → Deployment-ready pipeline.
Airline ticket prices fluctuate based on:
- Airline brand
- Number of stops
- Flight duration
- Time of departure
- Seasonal demand
The objective is to build a robust regression model that predicts flight ticket prices with high accuracy.
The dataset includes:
- Airline
- Source
- Destination
- Date of Journey
- Duration
- Total Stops
- Additional Info
- Price (Target)
✈️ Non-stop flights generally cost more.- 🕒 Longer duration flights are often cheaper.
- 📅 Month & season significantly impact pricing.
- 🏷 Premium airlines maintain higher base fares.
- 🌙 Early departures can influence ticket cost.
Engineered features include:
- Journey Month
- Journey Day
- Departure Hour
- Arrival Hour
- Duration in Minutes
- Weekend Indicator
- Peak Season Flag
Categorical Encoding:
- One-Hot Encoding (Nominal)
- Ordinal Encoding (Stops)
Outlier Handling:
- IQR-based filtering
- Log transformation on price (optional)
Implemented using sklearn Pipeline:
Pipeline([
('preprocessing', ColumnTransformer(...)),
('model', XGBRegressor())
])Pipeline handles:
- Missing values
- Encoding
- Scaling (if needed)
- Model training
- Cross-validation
| Model | R² Score | RMSE |
|---|---|---|
| Linear Regression | 0.62 | Medium |
| Random Forest | 0.83 | Low |
| XGBoost | 0.88 | Lowest |
Best model: XGBoost Regressor
- Train/Test Split (80/20)
- 5-Fold Cross Validation
- Hyperparameter tuning via GridSearchCV
- Early stopping (for boosting models)
Metrics Used:
- R² Score
- RMSE
- MAE
Error Analysis:
- Slight underprediction for premium airlines
- Higher variance for rare routes
- Model generalizes well across most routes
Project organized for scalability:
├── data/
├── notebooks/
├── src/
│ ├── preprocessing.py
│ ├── train.py
│ ├── evaluate.py
│ ├── predict.py
├── models/
├── api/
│ └── app.py
└── README.md- REST API (FastAPI)
- Dockerized deployment
- Streamlit dashboard
- CI/CD integration
- MLflow experiment tracking
- Model monitoring
- Python
- Pandas
- NumPy
- Seaborn / Matplotlib
- Scikit-learn
- XGBoost