Skip to content

Commit 32e81bd

Browse files
authored
Initial commit
0 parents  commit 32e81bd

File tree

14 files changed

+1013
-0
lines changed

14 files changed

+1013
-0
lines changed

.flake8

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
#########################
2+
# Flake8 Configuration #
3+
# (.flake8) #
4+
#########################
5+
[flake8]
6+
ignore =
7+
# asserts are ok when testing.
8+
S101
9+
# pickle
10+
S301
11+
# pickle
12+
S403
13+
S404
14+
S603
15+
# Line break before binary operator (flake8 is wrong)
16+
W503
17+
# Ignore the spaces black puts before columns.
18+
E203
19+
# allow path extensions for testing.
20+
E402
21+
DAR101
22+
DAR201
23+
# flake and pylance disagree on linebreaks in strings.
24+
N400
25+
exclude =
26+
.tox,
27+
.git,
28+
__pycache__,
29+
docs/source/conf.py,
30+
build,
31+
dist,
32+
tests/fixtures/*,
33+
*.pyc,
34+
*.bib,
35+
*.egg-info,
36+
.cache,
37+
.eggs,
38+
data.
39+
max-line-length = 120
40+
max-complexity = 20
41+
import-order-style = pycharm
42+
application-import-names =
43+
seleqt
44+
tests

.github/workflows/test.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: Tests
2+
3+
on: [ push, pull_request ]
4+
5+
jobs:
6+
tests:
7+
name: Tests
8+
runs-on: ${{ matrix.os }}
9+
strategy:
10+
matrix:
11+
os: [ ubuntu-latest ]
12+
python-version: [3.11.0]
13+
steps:
14+
- uses: actions/checkout@v2
15+
- name: Set up Python ${{ matrix.python-version }}
16+
uses: actions/setup-python@v2
17+
with:
18+
python-version: ${{ matrix.python-version }}
19+
- name: Install dependencies
20+
run: pip install nox
21+
- name: Test with pytest
22+
run:
23+
nox -s test
24+
lint:
25+
name: Lint
26+
runs-on: ubuntu-latest
27+
strategy:
28+
matrix:
29+
python-version: [3.11.0]
30+
steps:
31+
- uses: actions/checkout@v2
32+
- name: Set up Python ${{ matrix.python-version }}
33+
uses: actions/setup-python@v2
34+
with:
35+
python-version: ${{ matrix.python-version }}
36+
- name: Install dependencies
37+
run: pip install nox
38+
- name: Run flake8
39+
run: nox -s lint
40+
- name: Run mypy
41+
run: nox -s typing

.gitignore

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
.vscode/
2+
.pytest_cache/
3+
4+
# Byte-compiled / optimized / DLL files
5+
__pycache__/
6+
*.py[cod]
7+
*$py.class
8+
9+
# C extensions
10+
*.so
11+
12+
# Distribution / packaging
13+
.Python
14+
build/
15+
develop-eggs/
16+
dist/
17+
downloads/
18+
eggs/
19+
.eggs/
20+
lib/
21+
lib64/
22+
parts/
23+
sdist/
24+
var/
25+
wheels/
26+
share/python-wheels/
27+
*.egg-info/
28+
.installed.cfg
29+
*.egg
30+
MANIFEST
31+
32+
# PyInstaller
33+
# Usually these files are written by a python script from a template
34+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
35+
*.manifest
36+
*.spec
37+
38+
# Installer logs
39+
pip-log.txt
40+
pip-delete-this-directory.txt
41+
42+
# Unit test / coverage reports
43+
htmlcov/
44+
.tox/
45+
.nox/
46+
.coverage
47+
.coverage.*
48+
.cache
49+
nosetests.xml
50+
coverage.xml
51+
*.cover
52+
*.py,cover
53+
.hypothesis/
54+
.pytest_cache/
55+
cover/
56+
57+
# Translations
58+
*.mo
59+
*.pot
60+
61+
# Django stuff:
62+
*.log
63+
local_settings.py
64+
db.sqlite3
65+
db.sqlite3-journal
66+
67+
# Flask stuff:
68+
instance/
69+
.webassets-cache
70+
71+
# Scrapy stuff:
72+
.scrapy
73+
74+
# Sphinx documentation
75+
docs/_build/
76+
77+
# PyBuilder
78+
.pybuilder/
79+
target/
80+
81+
# Jupyter Notebook
82+
.ipynb_checkpoints
83+
84+
# IPython
85+
profile_default/
86+
ipython_config.py
87+
88+
# pyenv
89+
# For a library or package, you might want to ignore these files since the code is
90+
# intended to run in multiple environments; otherwise, check them in:
91+
# .python-version
92+
93+
# pipenv
94+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
95+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
96+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
97+
# install all needed dependencies.
98+
#Pipfile.lock
99+
100+
# poetry
101+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
102+
# This is especially recommended for binary packages to ensure reproducibility, and is more
103+
# commonly ignored for libraries.
104+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
105+
#poetry.lock
106+
107+
# pdm
108+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
109+
#pdm.lock
110+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
111+
# in version control.
112+
# https://pdm.fming.dev/#use-with-ide
113+
.pdm.toml
114+
115+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
116+
__pypackages__/
117+
118+
# Celery stuff
119+
celerybeat-schedule
120+
celerybeat.pid
121+
122+
# SageMath parsed files
123+
*.sage.py
124+
125+
# Environments
126+
.env
127+
.venv
128+
env/
129+
venv/
130+
ENV/
131+
env.bak/
132+
venv.bak/
133+
134+
# Spyder project settings
135+
.spyderproject
136+
.spyproject
137+
138+
# Rope project settings
139+
.ropeproject
140+
141+
# mkdocs documentation
142+
/site
143+
144+
# mypy
145+
.mypy_cache/
146+
.dmypy.json
147+
dmypy.json
148+
149+
# Pyre type checker
150+
.pyre/
151+
152+
# pytype static type analyzer
153+
.pytype/
154+
155+
# Cython debug symbols
156+
cython_debug/
157+
158+
# PyCharm
159+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
160+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
161+
# and can be added to the global gitignore or merged into this file. For a more nuclear
162+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
163+
#.idea/

README.md

Lines changed: 126 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,126 @@
1+
# Dimensionality Reduction Exercise
2+
3+
In this exercise, we will take a closer look at the mechanics of Principal Component Analysis (PCA). We will explore how PCA can reduce the complexity of our data and understand the practical benefits of this dimensionality reduction technique. Our first goal is to project our high-dimensional data onto a more compact feature space. We will then visualize how, even with this reduced set of features, we can retain the most information. This insight will serve as a basis for preprocessing the data that was used in our previous Support Vector Classification (SVC) exercise. We will observe the impact of this dimensionality reduction on our subsequent SVC training.
4+
5+
### Task 1: Principal Component Analysis
6+
7+
In this task, we will implement all the necessary steps to perform a PCA and visualize how much of the original information content of an image remains after the image features are projected into a lower dimensional space. To achieve this, we will treat each row of the input image as an individual data sample, with the features represented by the RGB values in each column. We then apply PCA to these samples to obtain the principal components, project each sample onto the first _k_ principal components, and then back into the original space. The example image we use has _531 rows x 800 columns x 3 color values_, resulting in 531 samples with 2400 features each.
8+
9+
Navigate to `src/ex1_pca.py` and have a look at the `__main__` function :
10+
11+
1. Create an empty directory called ``output`` (Hint: `os.makedirs`).
12+
2. Load the `statue.jpg` image from `data/images/` using ``imageio.imread``, plot it and save it into the ``output`` directory as ``original.png`` using `imsave` from `skimage.io`. (Hint: For `skimage.io` to be able to correctly interpret your image, you should cast it to the uint8 dtype, for example by using ``your_array.astype(np.uint8)``.)
13+
14+
> Note: You can also use ``plt.imshow()`` followed by ``plt.savefig(your_path)`` as a simple way to save the image. Do not forget to ``plt.close()`` your plot afterwards, because we will export a fair amount of images in this exercise.
15+
16+
3. Reshape the image array into a 2D-array of shape $(n,m)$, where $n$ = `num_rows` and $m$ = `num_columns * num_channels`, such that each row of this new array represents all pixel values of the corresponding image row.
17+
18+
Now we will implement the functions to perform a PCA transform and an inverse transform on our 2D array. First implement the function `pca_transform`:
19+
20+
4. Compute the mean vector over the features of the input matrix. The resulting mean vector should have the size $(1,m)$.
21+
5. Center the data by subtracting the mean from the 2D image array.
22+
6. Compute the covariance matrix of the centered data. (Hint: `numpy.cov`, set `rowvar=False` in order to compute the covariances on features.)
23+
7. Perform the eigendecomposition of the covariance matrix. (Hint: `numpy.linalg.eigh`)
24+
8. Sort eigenvalues in descending order and eigenvectors by their descending eigenvalues.
25+
9. Return sorted eigenvalues, eigenvectors, centered data and the mean vector.
26+
27+
Next, implement the function `pca_inverse_transform`, which reconstructs the data using the top $n_comp$ principal components following these steps:
28+
29+
10. Select the first $n_comp$ components from the given eigenvectors.
30+
11. Project the centered data onto the space defined by the selected eigenvectors by multiplying both matrices, giving us the reduced data.
31+
12. Reconstruct the data projecting it back to the original space by multiplying the reduced data with the transposed selected eigenvectors. Don't forget to add the mean vector afterwards.
32+
13. Return the reconstructed data.
33+
34+
Now, before returning to the `__main__` function, we also want to calculate the explained variance associated with our principal components. For that, implement the `expl_var` function following these steps:
35+
36+
14. Calculate the total variance by summing up all the eigenvalues.
37+
15. Compute the cumulative explained variance by summing the first 'n_comp' eigenvalues.
38+
16. Determine the cumulative explained variance ratio by dividing the cumulative explained variance by the total variance. Return the result.
39+
40+
Go back to the `__main__` function and implement the following TODOs:
41+
42+
17. Loop through a range of all possible values of the number of components. It is sufficient to use the step size of 10 to speed up the process. To monitor the progress of the loop, you can create a progress bar using [the very handy Python package tqdm](https://github.com/tqdm/tqdm).
43+
44+
17.1. Perform PCA using the previously implemented `pca_transform` function.
45+
17.2. Apply the `pca_inverse_transform` function to project the image to lower-dimensional space using the current number of components and reconstruct the image from this reduced representation.
46+
17.3. Bring the resulting array back into the original image shape and save it in the ``output`` folder as an image called ``pca_k.png``, where _k_ is replaced with the number of components used to create the image.
47+
48+
> Note: You should again cast the image back to the uint8 dtype.
49+
50+
17.4. Compute the cumulative explained variance ratio for the current number of components using the `expl_var` function and store it in a list for later plotting.
51+
17.5. We would also like to quantify how closely our created image resembles the original one. Use ``skimage.metrics.structural_similarity`` to compute a perceptual similarity score (SSIM) between the original and the reconstructed image and also store it in another list for later plotting. As we deal with RGB images, you have to pass `channel_axis=2` to the SSIM function.
52+
53+
18. Plot the cumulative explained variances of each principal component against the number of components.
54+
19. Plot the SSIM values against the number of components. If you like a small matplotlib challenge, you can also try to add this curve with a second scale to the first plot (you can find an example on how to do this [in the matplotlib gallery](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/two_scales.html)).
55+
20. Look through the images you generated and find the one with the smallest _k_ which you would deem indistinguishable from the original. Compare this to both the explained variance and SSIM curves.
56+
21. Test your code with the test framework of vscode or by typing `nox -r -s test` in your terminal.
57+
58+
### Task 2: PCA as Pre-processing
59+
60+
We have seen that a significantly reduced feature dimensionality is often sufficient to effectively represent our data, especially in the case of image data. Building upon this insight, we will now revisit our Support Vector Classification (SVC) task from Day 06, but this time with a preprocessing of our data using a PCA. Again, we will use the [Labeled Faces in the Wild Dataset](http://vis-www.cs.umass.edu/lfw/).
61+
62+
We start in the the `__main__` function.
63+
64+
1. Load the dataset from ``sklearn.datasets.fetch_lfw_people`` in the same way as for Task 2 of Day 06 and get access to the data.
65+
2. Split the data 80:20 into training and test data. Use `random_state=42` in the split function.
66+
3. Use the `StandardScaler` from `sklearn.preprocessing` on the train set and scale both the train and the test set.
67+
68+
Our goal now is to determine the minimum number of principal components needed to capture at least 90% of the variance in the data. First, implement the `explained_var` function:
69+
70+
4. Create an ``sklearn.decomposition.PCA`` instance and fit it to the data samples. Set `random_state=42` and `whiten=True` to normalize the components to have unit variance.
71+
5. Plot the cumulative explained variance ratios of each principal component against the number of components using the ``explained_variance_ratio_`` property of the ``PCA`` instance. Note, that you have to sum up these ratios to get cumulative values (e.g. using ``np.cumsum``).
72+
6. Return the array of cumulative explained variance ratios.
73+
74+
7. Return to the `__main__` function and use the `explained_var` function to calculate the minimum number of components needed to capture 90% of the variance. Print this number.
75+
76+
Implement the `pca_train` function to train a model on preprocessed data:
77+
78+
8. Create a ``PCA`` instance and fit it to the data samples extracting the given number of components. Set `random_state=42` and `whiten=True`.
79+
9. Project the input data on the orthonormal basis using the `PCA.transform`, resulting in a new dataset, where each sample is represented by the given number of the top principal components.
80+
10. Call the `train_fun` function, which is passed as an argument, to train a model on the transformed PCA features.
81+
11. The function should return a tuple containing two elements: the PCA decomposition object and the trained model.
82+
83+
12. Import or paste your cv_svm function from Task 2 of Day 06 above the code. Utilize it together with the computed number of required components to call `pca_train` in the `__main__` function. This will allow us to train the model with the reduced feature set. Use the `time` function from the `time` module to measure and print the duration of this process for evaluation.
84+
85+
13. To evaluate the model on the test set, we need to perform the same transform on the test data, as we did on the training data. Use the `PCA.transform` of your PCA decomposition object to do this.
86+
14. Now we can compute and print the accuracy of our trained model on the test set.
87+
88+
15. In order to compare this model with the one without the PCA preprocessing, apply the function `cv_svm` on the original training set and measure the time.
89+
90+
16. Compute and print the accuracy of this trained model on the test set.
91+
92+
17. (Optional) You can use the `plot_image_matrix` function from `src/util_pca.py` to plot the top 12 eigenfaces.
93+
94+
18. (Optional) Furthermore, you can use the `plot_roc` function from `src/util_pca.py` to plot the ROC curves of both models.
95+
96+
19. Test your code with the test framework of vscode or by typing `nox -r -s test` in your terminal.
97+
98+
99+
#### (Optional) Grid Search and Nested CV for best Number of Principal Components
100+
101+
We have seen how PCA can improve both the runtime and the results of our training. You can practice your coding skills by implementing nested cross-validation. While we already have the inner cross-validation for hyperparameter tuning, we can also employ an outer cross-validation to determine the optimal number of principal components to include in our analysis.
102+
103+
Implement the `gs_pca` function, which utilizes a grid search approach to determine the most suitable number of PCA components for feature dimensionality reduction before the training:
104+
105+
20. Define the outer k-fold cross-validation strategy with 5 folds using `KFold` from `sklearn.model_selection`.
106+
21. Next, initialize the variables to keep track of the best mean accuracy score and the corresponding number of PCA components found 22. Iterate through the specified list of PCA component values.
107+
108+
22.1. Create an outer 5-fold cross-validation loop, iterating through the 5 splits while obtaining the training and testing indices for each split.
109+
22.1.1. Generate the current training and testing sets from the given data based on these indices.
110+
22.1.2. Scale the generated data fitting the `StandardScaler` on the training set and scale the training and test sets.
111+
22.1.3. Instantiate a PCA object with the same parameters as before and transform the training data.
112+
22.1.4. Now is the time to call our function `cv_svm` and perform the inner cross-validation to tune hyperparameters. In order to save you the time, we have determined that the following parameters consistently yield the best results: C=10 and kernel='rbf'. Therefore, you can skip the inner cross-validation step and proceed to create and train your classifier with these predefined parameters.
113+
22.1.5. Predict the labels on the test data and compute the accuracy score for each fold.
114+
115+
22.2. Calculate the mean accuracy score across the folds.
116+
22.3. If the mean accuracy score for the current number of PCA components is higher than the best score seen so far, update the best score and the best number components.
117+
23. The function should return the number of PCA components that yielded the highest mean accuracy score during the grid search. This represents the optimal number of components for feature dimensionality reduction.
118+
119+
Go back to the `__main__` function.
120+
121+
24. Generate the list of the parameters to perform the grid search, consisting of the following numbers: `[c-10 c-5 c c+5 c+10]`, where $c$ is the number determined in step 7.
122+
25. Use the `gs_pca` function to determine the best number of components and print it.
123+
26. Repeat the steps 12-14 with this best number and compare the new accuracy.
124+
27. Test your code with the test framework of vscode or by typing `nox -r -s test` in your terminal.
125+
126+

data/images/statue.jpg

35.7 KB
Loading

0 commit comments

Comments
 (0)