π GEO Dataset: GSE10810
π¨βπ¬ Conducted during and after the ABCON 2025 Workshop
π
 Analysis completed on: October 8, 2025
π Full details in analysis_report.md
This project performs a detailed, step-by-step machine learning analysis of breast cancer gene expression data. The workflow includes data preprocessing, normalization, feature selection, model training, evaluation, and hyperparameter tuning to identify potential biomarker genes and build predictive classifiers.
The analysis pipeline ensures reproducibility and biological relevance, highlighting genes that distinguish breast cancer tissue from normal breast tissue.
- Overview
 - Objectives
 - Dataset Summary
 - Key Analysis Steps
 - Biological Insights
 - Project Structure
 - Highlighted Outputs
 - Interactive HTML Reports
 - Selected Results (Preview)
 - Reproducibility
 - Acknowledgment
 - Author Contribution
 - Citation & Usage
 - License
 
- Preprocess and normalize breast cancer gene expression datasets.
 - Identify top informative genes for classification.
 - Build and evaluate machine learning models (Random Forest, SVM) for cancer vs. normal tissue prediction.
 - Perform feature selection and identify potential biomarker genes.
 
| Feature | Description | 
|---|---|
| Organism | Homo sapiens | 
| Samples | 58 breast tissue samples (31 cancer, 27 normal) | 
| Data Type | Gene expression (microarray, cleaned and normalized) | 
| Platform | Affymetrix Human Genome U133 Plus 2.0 Array | 
| GEO Accession | GSE10810 | 
| Publication | Pedraza V. et al., 2010 β’ DOI: 10.1002/cncr.24805 | 
All implemented in modular Jupyter Notebooks and exported to HTML with figures and outputs.
| Notebook | Description | 
|---|---|
0 | 
Load required Python libraries and confirm environment setup | 
1 | 
Data exploration & cleaning: statistical summaries, visualization, PCA | 
2 | 
Preprocessing & normalization: IQR-based filtering, MinMax scaling, PCA visualization | 
3 | 
Feature selection: Mutual Information-based selection of top 50 genes | 
4 | 
Model building & evaluation: Random Forest and SVM training, confusion matrices, top marker genes identification | 
5 | 
Cross-validation & hyperparameter tuning: GridSearchCV with 5-fold CV for optimized model parameters | 
- Identified top 50 informative genes that distinguish breast cancer from normal tissue.
 - Random Forest and SVM classifiers achieved 94% accuracy on the test set.
 - Top marker genes include CD300LG, ANLN, CRTAP, MOCS1, PTCH1, which are known to be involved in cancer biology.
 - Cross-validation and hyperparameter tuning confirmed model stability and high predictive performance.
 - Results provide potential biomarker candidates for further biological validation.
 
Machine-Learning-Based-Analysis-Of-Gene-Expression-In-Breast-Cancer
β
βββ data                       # Input datasets
β   βββ GSE10810_Expression_Matrix_cleaned.xls
β
βββ scripts                    # Jupyter Notebooks for each analysis step
β   βββ Notebook_0_Load_Libraries.ipynb
β   βββ Notebook_1_Data_Exploration_and_Cleaning.ipynb
β   βββ Notebook_2_Preprocessing_Normalization.ipynb
β   βββ Notebook_3_Feature_Selection.ipynb
β   βββ Notebook_4_Model_Building_Evaluation.ipynb
β   βββ Notebook_5_CrossValidation_Hyperparameter_Tuning.ipynb
β
βββ results                    # Processed results and outputs
β   βββ X_test_selected.pkl
β   βββ X_train_selected.pkl
β   βββ y_test.pkl
β   βββ y_train.pkl
β   βββ best_RF_model.pkl
β   βββ best_SVM_model.pkl
β   βββ selected_gene_names.pkl
β   βββ selected_genes_top50.xls
β   βββ top_20_marker_genes.xls
β   βββ cv_model_performance_summary.xls
β   βββ data_cleaned_with_labels.xls
β   βββ data_normalized_minmax.xls
β   βββ pca_normalized_coordinates.xls
β
βββ figures                    # Visual outputs and plots
β   βββ boxplot_comparison.png
β   βββ histogram_distribution.png
β   βββ correlation_heatmap.png
β   βββ pca_plot.png
β   βββ class_distribution_pie.png
β   βββ histogram_distribution_normalized.png
β   βββ pca_plot_normalized.png
β   βββ feature_scores_barplot.png
β   βββ confusion_matrices.png
β   βββ top_20_marker_genes.png
β   βββ model_comparison_cv.png
β
βββ docs                       # HTML exports of notebooks
β   βββ Notebook_0_Load_Libraries.html
β   βββ Notebook_1_Data_Exploration_and_Cleaning.html
β   βββ Notebook_2_Preprocessing_Normalization.html
β   βββ Notebook_3_Feature_Selection.html
β   βββ Notebook_4_Model_Building_Evaluation.html
β   βββ Notebook_5_CrossValidation_Hyperparameter_Tuning.html
β
βββ analysis_report.md          # Full explanation of analysis steps and results
βββ README.md                   # Project summary and guidance
| Output Type | File | 
|---|---|
| Cleaned Expression Data | GSE10810_Expression_Matrix_cleaned.csv | 
| Filtered Genes (IQR-based) | data_filtered_iqr.xls | 
| Normalized Expression | data_normalized_minmax.xls | 
| PCA Coordinates | pca_normalized_coordinates.xls | 
| Selected Genes | selected_genes_top50.csv | 
| Top 20 Marker Genes | top_20_marker_genes.csv | 
| Model Performance Summary | cv_model_performance_summary.csv | 
This project includes interactive HTML versions of all Jupyter notebooks, allowing easy exploration of the analysis workflow and outputs.
- Each HTML report contains formatted text, tables, figures, and code outputs for reproducibility.
 - When viewed directly in GitHub, the 
.htmlfiles may appear as raw HTML code. - To view formatted reports, open the links below or download and open them in your local browser.
 
| Step | Notebook | HTML File | 
|---|---|---|
| 00 | Load Libraries | Notebook_0_Load_Libraries.html | 
| 01 | Data Exploration and Cleaning | Notebook_1_Data_Exploration_and_Cleaning.html | 
| 02 | Preprocessing and Normalization | Notebook_2_Preprocessing_Normalization.html | 
| 03 | Feature Selection | Notebook_3_Feature_Selection.html | 
| 04 | Model Building and Evaluation | Notebook_4_Model_Building_Evaluation.html | 
| 05 | Cross-Validation and Hyperparameter Tuning | Notebook_5_CrossValidation_Hyperparameter_Tuning.html | 
Use these HTML reports to explore the analysis interactively and review detailed results.
Understanding the balance between tumor and normal samples.
Before and after preprocessing β illustrating data normalization effects.
| Raw Data | Normalized Data | 
|---|---|
![]()  | 
![]()  | 
Shows pairwise relationships among top genes.
Visualizes clustering patterns before and after normalization.
| PCA (Raw Data) | PCA (Normalized) | 
|---|---|
![]()  | 
![]()  | 
Displays the most informative genes selected for classification.
Highlights gene features contributing most to model prediction.
Evaluates model performance on training and test sets.
Comparative performance of classifiers (Accuracy, F1-score, AUC) with k-fold validation.
| Model Comparison | 
|---|
![]()  | 
> These visualizations summarize the **machine learning workflow** β from raw data exploration to model validation β providing a clear overview of feature importance, performance, and reproducibility.
All analytical steps were performed in six modular Jupyter notebooks, each focusing on a specific phase of the pipeline. They can be run sequentially or individually to inspect intermediate results, figures, and trained models.
| Step | Notebook | Description | 
|---|---|---|
| 00 | Notebook_0_Load_Libraries.ipynb | Import dependencies and initialize the environment | 
| 01 | Notebook_1_Data_Exploration_and_Cleaning.ipynb | Explore raw data, handle missing values, and detect outliers | 
| 02 | Notebook_2_Preprocessing_Normalization.ipynb | Perform feature scaling, normalization, and encoding | 
| 03 | Notebook_3_Feature_Selection.ipynb | Select the most informative genes using variance and correlation filters | 
| 04 | Notebook_4_Model_Building_Evaluation.ipynb | Train ML classifiers (SVM, RF, LR) and evaluate model performance | 
| 05 | Notebook_5_CrossValidation_Hyperparameter_Tuning.ipynb | Apply k-fold cross-validation and tune hyperparameters for best accuracy | 
To export any notebook as HTML for interactive viewing:
!jupyter nbconvert --to html --embed-images "Notebook_4_Model_Building_Evaluation.ipynb" --output "Notebook_4_Model_Building_Evaluation.html"
All output tables are stored in the
results/directory, and visualizations are available in thefigures/folder, both embedded within the HTML reports for easy interpretation.
This project was developed as part of the ABCON 2025 Workshop, during and after the session: βMachine Learning in Biomedical Research: From Data to Diagnosisβ
We gratefully acknowledge the invaluable guidance and instruction provided by:
- 
Dr. Eman Badr Associate Professor, Director of the Computational Biology and Bioinformatics Unit, Zewail City of Science and Technology
 - 
Ms. Shrooq Badwy Research and Teaching Assistant, Bioinformatics Center, Helwan University in Cairo
 - 
Ms. Manar Samir M.Sc. Candidate, Computational and Bioinformatics Lab, Zewail City of Science and Technology
 
Their session and original analysis notebook provided the foundation for this project. The original single Jupyter Notebook was restructured, expanded, and refined by the participant into a complete multi-step workflow.
It was divided into six modular notebooks, each dedicated to a specific analytical task β from data preprocessing and feature engineering to model training, evaluation and visualization. Additional figures, metrics, and validation steps were incorporated to enhance the scientific depth, clarity and reproducibility of the analysis.
The entire workflow was then independently executed, documented, and organized into a reproducible folder structure with scripts, figures, and HTML reports, and finally published on GitHub for open access and future reuse .
All analytical steps β from data preprocessing and feature selection to model training, evaluation, and visualization β were independently executed by:
Mohamed H. Hussein M.Sc. Candidate in Biochemistry and Molecular Biology focusing on Molecular Cancer Biology & Bioinformatics Ain Shams University, Faculty of Science
The original single analysis notebook was restructured, modularized, and extended into a complete workflow composed of 6 Jupyter Notebooks.
Each notebook focuses on a distinct stage of the machine learning analysis pipeline, including:
- Environment setup and library loading β initializing dependencies and preparing the analysis workspace.
 - Data exploration and cleaning β inspecting dataset structure, handling missing values, and detecting outliers.
 - Preprocessing and normalization β scaling, transforming, and encoding features for modeling readiness.
 - Feature selection β identifying informative genes and reducing data dimensionality.
 - Model building and evaluation β training machine learning models (SVM, Random Forest) and assessing their performance.
 - Cross-validation and hyperparameter tuning β optimizing model accuracy and robustness through systematic parameter search.
 
All notebooks collectively form a complete, reproducible workflow from raw data exploration to optimized model deployment.
All outputs, visualizations, and metrics were generated, interpreted, and documented by the author in a fully reproducible folder structure, designed to promote transparency, reproducibility, and learning for future research use.
This project is open-source and provided for educational and academic purposes.
If you reuse, adapt, or build upon this work, please cite:
- The original GEO dataset: GSE10810
 - The ABCON 2025 Workshop titled: Machine Learning in Biomedical Research: From Data to Diagnosis"
 - The author and repository to acknowledge the analysis contributions:
 
Hussein, Mohamed H. (2025). Machine Learning-Based Analysis of Gene Expression Profiles in Breast Cancer [Data analysis workflow]. GitHub repository. π https://github.com/Mohamed-H-Hussein/Machine-Learning-Based-Analysis-of-Gene-Expression-Profiles-in-Breast-Cancer
Proper citation supports transparency, credit to contributors and reproducible science.
This repository is licensed under the MIT License.
See the full license details: https://opensource.org/licenses/MIT
Β© 2025 Mohamed H. Hussein. The software is provided "as is" without warranty of any kind.









