Skip to content

Machine learning workflow for analyzing breast cancer gene expression data , including preprocessing, feature selection, model training and performance evaluation in reproducible Jupyter notebooks.

License

Notifications You must be signed in to change notification settings

Mohamed-H-Hussein/Machine-Learning-Based-Analysis-Of-Gene-Expression-In-Breast-Cancer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Machine Learning-Based Analysis of Gene Expression Profiles in Breast Cancer

πŸ” GEO Dataset: GSE10810
πŸ‘¨β€πŸ”¬ Conducted during and after the ABCON 2025 Workshop
πŸ“… Analysis completed on: October 8, 2025
πŸ“ Full details in analysis_report.md


πŸ“š Overview

This project performs a detailed, step-by-step machine learning analysis of breast cancer gene expression data. The workflow includes data preprocessing, normalization, feature selection, model training, evaluation, and hyperparameter tuning to identify potential biomarker genes and build predictive classifiers.

The analysis pipeline ensures reproducibility and biological relevance, highlighting genes that distinguish breast cancer tissue from normal breast tissue.

πŸ“‘ Table of Contents

🎯 Objectives

  • Preprocess and normalize breast cancer gene expression datasets.
  • Identify top informative genes for classification.
  • Build and evaluate machine learning models (Random Forest, SVM) for cancer vs. normal tissue prediction.
  • Perform feature selection and identify potential biomarker genes.

πŸ§ͺ Dataset Summary

Feature Description
Organism Homo sapiens
Samples 58 breast tissue samples (31 cancer, 27 normal)
Data Type Gene expression (microarray, cleaned and normalized)
Platform Affymetrix Human Genome U133 Plus 2.0 Array
GEO Accession GSE10810
Publication Pedraza V. et al., 2010 β€’ DOI: 10.1002/cncr.24805

🧠 Key Analysis Steps

All implemented in modular Jupyter Notebooks and exported to HTML with figures and outputs.

Notebook Description
0 Load required Python libraries and confirm environment setup
1 Data exploration & cleaning: statistical summaries, visualization, PCA
2 Preprocessing & normalization: IQR-based filtering, MinMax scaling, PCA visualization
3 Feature selection: Mutual Information-based selection of top 50 genes
4 Model building & evaluation: Random Forest and SVM training, confusion matrices, top marker genes identification
5 Cross-validation & hyperparameter tuning: GridSearchCV with 5-fold CV for optimized model parameters

πŸ“ˆ Biological Insights

  • Identified top 50 informative genes that distinguish breast cancer from normal tissue.
  • Random Forest and SVM classifiers achieved 94% accuracy on the test set.
  • Top marker genes include CD300LG, ANLN, CRTAP, MOCS1, PTCH1, which are known to be involved in cancer biology.
  • Cross-validation and hyperparameter tuning confirmed model stability and high predictive performance.
  • Results provide potential biomarker candidates for further biological validation.

πŸ“‚ Project Structure


Machine-Learning-Based-Analysis-Of-Gene-Expression-In-Breast-Cancer
β”‚
β”œβ”€β”€ data                       # Input datasets
β”‚   └── GSE10810_Expression_Matrix_cleaned.xls
β”‚
β”œβ”€β”€ scripts                    # Jupyter Notebooks for each analysis step
β”‚   β”œβ”€β”€ Notebook_0_Load_Libraries.ipynb
β”‚   β”œβ”€β”€ Notebook_1_Data_Exploration_and_Cleaning.ipynb
β”‚   β”œβ”€β”€ Notebook_2_Preprocessing_Normalization.ipynb
β”‚   β”œβ”€β”€ Notebook_3_Feature_Selection.ipynb
β”‚   β”œβ”€β”€ Notebook_4_Model_Building_Evaluation.ipynb
β”‚   └── Notebook_5_CrossValidation_Hyperparameter_Tuning.ipynb
β”‚
β”œβ”€β”€ results                    # Processed results and outputs
β”‚   β”œβ”€β”€ X_test_selected.pkl
β”‚   β”œβ”€β”€ X_train_selected.pkl
β”‚   β”œβ”€β”€ y_test.pkl
β”‚   β”œβ”€β”€ y_train.pkl
β”‚   β”œβ”€β”€ best_RF_model.pkl
β”‚   β”œβ”€β”€ best_SVM_model.pkl
β”‚   β”œβ”€β”€ selected_gene_names.pkl
β”‚   β”œβ”€β”€ selected_genes_top50.xls
β”‚   β”œβ”€β”€ top_20_marker_genes.xls
β”‚   β”œβ”€β”€ cv_model_performance_summary.xls
β”‚   β”œβ”€β”€ data_cleaned_with_labels.xls
β”‚   β”œβ”€β”€ data_normalized_minmax.xls
β”‚   └── pca_normalized_coordinates.xls
β”‚
β”œβ”€β”€ figures                    # Visual outputs and plots
β”‚   β”œβ”€β”€ boxplot_comparison.png
β”‚   β”œβ”€β”€ histogram_distribution.png
β”‚   β”œβ”€β”€ correlation_heatmap.png
β”‚   β”œβ”€β”€ pca_plot.png
β”‚   β”œβ”€β”€ class_distribution_pie.png
β”‚   β”œβ”€β”€ histogram_distribution_normalized.png
β”‚   β”œβ”€β”€ pca_plot_normalized.png
β”‚   β”œβ”€β”€ feature_scores_barplot.png
β”‚   β”œβ”€β”€ confusion_matrices.png
β”‚   β”œβ”€β”€ top_20_marker_genes.png
β”‚   └── model_comparison_cv.png
β”‚
β”œβ”€β”€ docs                       # HTML exports of notebooks
β”‚   β”œβ”€β”€ Notebook_0_Load_Libraries.html
β”‚   β”œβ”€β”€ Notebook_1_Data_Exploration_and_Cleaning.html
β”‚   β”œβ”€β”€ Notebook_2_Preprocessing_Normalization.html
β”‚   β”œβ”€β”€ Notebook_3_Feature_Selection.html
β”‚   β”œβ”€β”€ Notebook_4_Model_Building_Evaluation.html
β”‚   └── Notebook_5_CrossValidation_Hyperparameter_Tuning.html
β”‚
β”œβ”€β”€ analysis_report.md          # Full explanation of analysis steps and results
└── README.md                   # Project summary and guidance


πŸ“Œ Highlighted Outputs

Output Type File
Cleaned Expression Data GSE10810_Expression_Matrix_cleaned.csv
Filtered Genes (IQR-based) data_filtered_iqr.xls
Normalized Expression data_normalized_minmax.xls
PCA Coordinates pca_normalized_coordinates.xls
Selected Genes selected_genes_top50.csv
Top 20 Marker Genes top_20_marker_genes.csv
Model Performance Summary cv_model_performance_summary.csv

πŸ“’ Interactive HTML Reports

This project includes interactive HTML versions of all Jupyter notebooks, allowing easy exploration of the analysis workflow and outputs.

  • Each HTML report contains formatted text, tables, figures, and code outputs for reproducibility.
  • When viewed directly in GitHub, the .html files may appear as raw HTML code.
  • To view formatted reports, open the links below or download and open them in your local browser.

πŸ“Ž View live HTML reports here:

πŸ‘‰ https://mohamed-h-hussein.github.io/Machine-Learning-Based-Analysis-Of-Gene-Expression-In-Breast-Cancer/

Available HTML Reports:

Step Notebook HTML File
00 Load Libraries Notebook_0_Load_Libraries.html
01 Data Exploration and Cleaning Notebook_1_Data_Exploration_and_Cleaning.html
02 Preprocessing and Normalization Notebook_2_Preprocessing_Normalization.html
03 Feature Selection Notebook_3_Feature_Selection.html
04 Model Building and Evaluation Notebook_4_Model_Building_Evaluation.html
05 Cross-Validation and Hyperparameter Tuning Notebook_5_CrossValidation_Hyperparameter_Tuning.html

Use these HTML reports to explore the analysis interactively and review detailed results.


πŸ–ΌοΈ Selected Results (Preview)

1️⃣ Class Distribution in the Dataset

Understanding the balance between tumor and normal samples.

Class Distribution Pie


2️⃣ Gene Expression Distribution (Raw vs Normalized)

Before and after preprocessing β€” illustrating data normalization effects.

Raw Data Normalized Data
Histogram Raw Histogram Normalized

3️⃣ Correlation Heatmap of Gene Features

Shows pairwise relationships among top genes.

Correlation Heatmap


4️⃣ PCA Visualization

Visualizes clustering patterns before and after normalization.

PCA (Raw Data) PCA (Normalized)
PCA Plot PCA Normalized

5️⃣ Top 20 Marker Genes

Displays the most informative genes selected for classification.

Top 20 Marker Genes


6️⃣ Feature Scores (Machine Learning Importance)

Highlights gene features contributing most to model prediction.

Feature Scores Barplot


7️⃣ Confusion Matrices

Evaluates model performance on training and test sets.

Confusion Matrices


8️⃣ Model Performance and Cross-Validation Results

Comparative performance of classifiers (Accuracy, F1-score, AUC) with k-fold validation.

Model Comparison
Model Comparison

> These visualizations summarize the **machine learning workflow** β€” from raw data exploration to model validation β€” providing a clear overview of feature importance, performance, and reproducibility.

πŸ” Reproducible Analysis Workflow (Jupyter Notebooks)

All analytical steps were performed in six modular Jupyter notebooks, each focusing on a specific phase of the pipeline. They can be run sequentially or individually to inspect intermediate results, figures, and trained models.

Step Notebook Description
00 Notebook_0_Load_Libraries.ipynb Import dependencies and initialize the environment
01 Notebook_1_Data_Exploration_and_Cleaning.ipynb Explore raw data, handle missing values, and detect outliers
02 Notebook_2_Preprocessing_Normalization.ipynb Perform feature scaling, normalization, and encoding
03 Notebook_3_Feature_Selection.ipynb Select the most informative genes using variance and correlation filters
04 Notebook_4_Model_Building_Evaluation.ipynb Train ML classifiers (SVM, RF, LR) and evaluate model performance
05 Notebook_5_CrossValidation_Hyperparameter_Tuning.ipynb Apply k-fold cross-validation and tune hyperparameters for best accuracy

To export any notebook as HTML for interactive viewing:

!jupyter nbconvert --to html --embed-images "Notebook_4_Model_Building_Evaluation.ipynb" --output "Notebook_4_Model_Building_Evaluation.html"

All output tables are stored in the results/ directory, and visualizations are available in the figures/ folder, both embedded within the HTML reports for easy interpretation.


🌟 Acknowledgment

This project was developed as part of the ABCON 2025 Workshop, during and after the session: β€œMachine Learning in Biomedical Research: From Data to Diagnosis”

We gratefully acknowledge the invaluable guidance and instruction provided by:

  • Dr. Eman Badr Associate Professor, Director of the Computational Biology and Bioinformatics Unit, Zewail City of Science and Technology

  • Ms. Shrooq Badwy Research and Teaching Assistant, Bioinformatics Center, Helwan University in Cairo

  • Ms. Manar Samir M.Sc. Candidate, Computational and Bioinformatics Lab, Zewail City of Science and Technology

Their session and original analysis notebook provided the foundation for this project. The original single Jupyter Notebook was restructured, expanded, and refined by the participant into a complete multi-step workflow.

It was divided into six modular notebooks, each dedicated to a specific analytical task β€” from data preprocessing and feature engineering to model training, evaluation and visualization. Additional figures, metrics, and validation steps were incorporated to enhance the scientific depth, clarity and reproducibility of the analysis.

The entire workflow was then independently executed, documented, and organized into a reproducible folder structure with scripts, figures, and HTML reports, and finally published on GitHub for open access and future reuse .


πŸ§‘β€πŸ”¬ Author Contribution

All analytical steps β€” from data preprocessing and feature selection to model training, evaluation, and visualization β€” were independently executed by:

Mohamed H. Hussein M.Sc. Candidate in Biochemistry and Molecular Biology focusing on Molecular Cancer Biology & Bioinformatics Ain Shams University, Faculty of Science

The original single analysis notebook was restructured, modularized, and extended into a complete workflow composed of 6 Jupyter Notebooks.

Each notebook focuses on a distinct stage of the machine learning analysis pipeline, including:

  1. Environment setup and library loading – initializing dependencies and preparing the analysis workspace.
  2. Data exploration and cleaning – inspecting dataset structure, handling missing values, and detecting outliers.
  3. Preprocessing and normalization – scaling, transforming, and encoding features for modeling readiness.
  4. Feature selection – identifying informative genes and reducing data dimensionality.
  5. Model building and evaluation – training machine learning models (SVM, Random Forest) and assessing their performance.
  6. Cross-validation and hyperparameter tuning – optimizing model accuracy and robustness through systematic parameter search.

All notebooks collectively form a complete, reproducible workflow from raw data exploration to optimized model deployment.

All outputs, visualizations, and metrics were generated, interpreted, and documented by the author in a fully reproducible folder structure, designed to promote transparency, reproducibility, and learning for future research use.


πŸ“ Citation & Usage

This project is open-source and provided for educational and academic purposes.

If you reuse, adapt, or build upon this work, please cite:

  • The original GEO dataset: GSE10810
  • The ABCON 2025 Workshop titled: Machine Learning in Biomedical Research: From Data to Diagnosis"
  • The author and repository to acknowledge the analysis contributions:

Hussein, Mohamed H. (2025). Machine Learning-Based Analysis of Gene Expression Profiles in Breast Cancer [Data analysis workflow]. GitHub repository. πŸ”— https://github.com/Mohamed-H-Hussein/Machine-Learning-Based-Analysis-of-Gene-Expression-Profiles-in-Breast-Cancer

Proper citation supports transparency, credit to contributors and reproducible science.


πŸ“œ License

License: MIT
This repository is licensed under the MIT License.
See the full license details: https://opensource.org/licenses/MIT


Β© 2025 Mohamed H. Hussein. The software is provided "as is" without warranty of any kind.

About

Machine learning workflow for analyzing breast cancer gene expression data , including preprocessing, feature selection, model training and performance evaluation in reproducible Jupyter notebooks.

Topics

Resources

License

Stars

Watchers

Forks