Skip to content

Commit e1c75bb

Browse files
committed
deploy: 77bd09b
1 parent f032fa8 commit e1c75bb

31 files changed

+2860
-141
lines changed
7.8 KB
Binary file not shown.
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"\n# Benchmark of Radius Clustering using multiple datasets and comparison with custom MDS\n\nThis example demonstrates how to implement a custom solver for the MDS problem\nand use it within the Radius Clustering framework.\nPlus, it compares the results of a naive implementation using the \n`NetworkX` library with the Radius Clustering implementation.\n\nThe example includes:\n 1. Defining the custom MDS solver.\n 2. Defining datasets to test the clustering.\n 3. Applying Radius clustering on the datasets using the custom MDS solver.\n 4. Ensure this solution works.\n 5. Establish a benchmark procedure to compare the Radius clustering with a naive implementation using `NetworkX`.\n 6. Comparing the results in terms of :\n - Execution time\n - Number of cluster found\n 7. Visualizing the benchmark results.\n 8. Visualizing the clustering results.\n\nThis example is useful for understanding how to implement a custom MDS solver\nand how to perform an advanced usage of the package.\n"
8+
]
9+
},
10+
{
11+
"cell_type": "code",
12+
"execution_count": null,
13+
"metadata": {
14+
"collapsed": false
15+
},
16+
"outputs": [],
17+
"source": [
18+
"# Author: Haenn Quentin\n# SPDX-License-Identifier: MIT"
19+
]
20+
},
21+
{
22+
"cell_type": "markdown",
23+
"metadata": {},
24+
"source": [
25+
"## Import necessary libraries\n\nSince this example is a benchmark, we need to import the necessary libraries\nto perform the benchmark, including `NetworkX` for the naive implementation,\n`matplotlib` for visualization, and `sklearn` for the datasets.\n\n"
26+
]
27+
},
28+
{
29+
"cell_type": "code",
30+
"execution_count": null,
31+
"metadata": {
32+
"collapsed": false
33+
},
34+
"outputs": [],
35+
"source": [
36+
"import networkx as nx\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport time\nimport warnings\n\nfrom sklearn.datasets import fetch_openml\nfrom radius_clustering import RadiusClustering\nfrom sklearn.metrics import pairwise_distances_argmin\n\nwarnings.filterwarnings(\"ignore\", category=RuntimeWarning, module=\"sklearn\")"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"## Define a custom MDS solver\n\nWe define a custom MDS solver that uses the `NetworkX` library to compute the MDS.\nNote the signature of the function is identical to the one used in the `RadiusClustering` class.\n\n"
44+
]
45+
},
46+
{
47+
"cell_type": "code",
48+
"execution_count": null,
49+
"metadata": {
50+
"collapsed": false
51+
},
52+
"outputs": [],
53+
"source": [
54+
"def custom_solver(n: int, edges: np.ndarray, nb_edges: int, random_state=None):\n \"\"\"\n Custom MDS solver using NetworkX to compute the MDS problem.\n \n Parameters:\n -----------\n n : int\n The number of points in the dataset.\n edges : np.ndarray\n The edges of the graph, flattened into a 1D array.\n nb_edges : int\n The number of edges in the graph.\n random_state : int | None\n The random state to use for reproducibility.\n \n Returns:\n --------\n centers : list\n A sorted list of the centers of the clusters.\n mds_exec_time : float\n The execution time of the MDS algorithm in seconds.\n \"\"\"\n G = nx.Graph()\n G.add_edges_from(edges)\n \n start_time = time.time()\n centers = list(nx.algorithms.dominating.dominating_set(G))\n mds_exec_time = time.time() - start_time\n\n centers = sorted(centers)\n\n return centers, mds_exec_time"
55+
]
56+
},
57+
{
58+
"cell_type": "markdown",
59+
"metadata": {},
60+
"source": [
61+
"## Define datasets to test the clustering\n\nWe will use 4 datasets to test the clustering:\n1. Iris dataset\n2. Wine dataset\n3. Breast Cancer dataset (WDBC)\n4. Vehicle dataset\nThese are common datasets used in machine learning and lead to pretty fast results.\nStructure of the variable `DATASETS`:\n- The key is the name of the dataset.\n- The value is a tuple containing:\n - The dataset fetched from OpenML.\n - The radius to use for the Radius clustering. (determined in literature, see references on home page)\n\n\n"
62+
]
63+
},
64+
{
65+
"cell_type": "code",
66+
"execution_count": null,
67+
"metadata": {
68+
"collapsed": false
69+
},
70+
"outputs": [],
71+
"source": [
72+
"DATASETS = {\n \"iris\": (fetch_openml(name=\"iris\", version=1, as_frame=False), 1.43),\n \"wine\": (fetch_openml(name=\"wine\", version=1, as_frame=False), 232.09),\n \"glass\": (fetch_openml(name=\"glass\", version=1, as_frame=False), 3.94),\n \"ionosphere\": (fetch_openml(name=\"ionosphere\", version=1, as_frame=False), 5.46),\n \"breast_cancer\": (fetch_openml(name=\"wdbc\", version=1, as_frame=False), 1197.42),\n \"synthetic\": (fetch_openml(name=\"synthetic_control\", version=1, as_frame=False), 70.12),\n \"vehicle\": (fetch_openml(name=\"vehicle\", version=1, as_frame=False), 155.05),\n \"yeast\": (fetch_openml(name=\"yeast\", version=1, as_frame=False), 0.4235),\n}"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"## Define the benchmark procedure\n\nWe define a function to perform the benchmark on the datasets.\nThe procedure is as follows:\n1. Creates an instance of RadiusClustering for each solver.\n2. For each instance, fit the algorithm on each dataset.\n3. Store the execution time and the number of clusters found for each dataset.\n4. Return the results as a dictionary.\n\n"
80+
]
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": null,
85+
"metadata": {
86+
"collapsed": false
87+
},
88+
"outputs": [],
89+
"source": [
90+
"def benchmark_radius_clustering():\n results = {}\n exact = RadiusClustering(manner=\"exact\", radius=1.43)\n approx = RadiusClustering(manner=\"approx\", radius=1.43)\n custom = RadiusClustering(\n manner=\"custom\", radius=1.43\n )\n custom.set_solver(custom_solver) # Set the custom solver\n algorithms = [exact, approx, custom]\n # Loop through each algorithm and dataset\n for algo in algorithms:\n algo_results = {}\n time_algo = []\n clusters_algo = []\n # Loop through each dataset\n for name, (dataset, radius) in DATASETS.items():\n X = dataset.data\n # set the radius for the dataset considered\n setattr(algo, \"radius\", radius)\n # Fit the algorithm\n t0 = time.time()\n algo.fit(X)\n t_algo = time.time() - t0\n\n # Store the results\n time_algo.append(t_algo)\n clusters_algo.append(len(algo.centers_))\n algo_results[\"time\"] = time_algo\n algo_results[\"clusters\"] = clusters_algo\n results[algo.manner] = algo_results\n\n return results"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"metadata": {},
96+
"source": [
97+
"## Run the benchmark and plot the results\nWe run the benchmark and plot the results for each dataset.\n\n"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"metadata": {
104+
"collapsed": false
105+
},
106+
"outputs": [],
107+
"source": [
108+
"results = benchmark_radius_clustering()\n\n# Plot the results\nfig, axs = plt.subplot_mosaic(\n [\n [\"time\", \"time\", \"time\", \"time\"],\n [\"iris\", \"wine\", \"breast_cancer\", \"vehicle\"],\n [\"glass\", \"ionosphere\", \"synthetic\", \"yeast\"],\n ],\n layout=\"constrained\",\n figsize=(12, 8),\n)\nfig.suptitle(\"Benchmark of Radius Clustering Solvers\", fontsize=16)\n\naxs['time'].set_yscale('log') # Use logarithmic scale for better visibility\nfor algo, algo_results in results.items():\n # Plot execution time\n axs['time'].plot(\n DATASETS.keys(),\n algo_results[\"time\"],\n marker='o',\n label=algo,\n )\n # Plot number of clusters\n\nfor i, (name, (dataset, _)) in enumerate(DATASETS.items()):\n axs[name].bar(\n results.keys(),\n [results[algo][\"clusters\"][i] for algo in results.keys()],\n label=name,\n )\n axs[name].axhline(\n y=len(set(dataset.target)), # Number of unique classes in the dataset\n label=\"True number of clusters\",\n color='r',\n linestyle='--',\n )\n axs[name].set_title(name)\n axs[name].set_xlabel(\"Algorithms\")\n\naxs[\"iris\"].set_ylabel(\"Number of clusters\")\naxs[\"glass\"].set_ylabel(\"Number of clusters\")\n\naxs['time'].set_title(\"Execution Time (log scale)\")\naxs['time'].set_xlabel(\"Datasets\")\naxs['time'].set_ylabel(\"Time (seconds)\")\naxs['time'].legend(title=\"Algorithms\")\nplt.tight_layout()\nplt.show()"
109+
]
110+
},
111+
{
112+
"cell_type": "markdown",
113+
"metadata": {},
114+
"source": [
115+
"## Conclusion\n\nIn this example, we applied Radius clustering to the Iris and Wine datasets and compared it with KMeans clustering.\nWe visualized the clustering results and the difference between the two clustering algorithms.\nWe saw that Radius Clustering can lead to smaller clusters than kmeans, which produces much more equilibrate clusters.\nThe difference plot can be very useful to see where the two clustering algorithms differ.\n\n"
116+
]
117+
}
118+
],
119+
"metadata": {
120+
"kernelspec": {
121+
"display_name": "Python 3",
122+
"language": "python",
123+
"name": "python3"
124+
},
125+
"language_info": {
126+
"codemirror_mode": {
127+
"name": "ipython",
128+
"version": 3
129+
},
130+
"file_extension": ".py",
131+
"mimetype": "text/x-python",
132+
"name": "python",
133+
"nbconvert_exporter": "python",
134+
"pygments_lexer": "ipython3",
135+
"version": "3.12.3"
136+
}
137+
},
138+
"nbformat": 4,
139+
"nbformat_minor": 0
140+
}
9.92 KB
Binary file not shown.
17.7 KB
Binary file not shown.
0 Bytes
Binary file not shown.
Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
"""
2+
=====================================================================================
3+
Benchmark of Radius Clustering using multiple datasets and comparison with custom MDS
4+
=====================================================================================
5+
6+
This example demonstrates how to implement a custom solver for the MDS problem
7+
and use it within the Radius Clustering framework.
8+
Plus, it compares the results of a naive implementation using the
9+
`NetworkX` library with the Radius Clustering implementation.
10+
11+
The example includes:
12+
1. Defining the custom MDS solver.
13+
2. Defining datasets to test the clustering.
14+
3. Applying Radius clustering on the datasets using the custom MDS solver.
15+
4. Ensure this solution works.
16+
5. Establish a benchmark procedure to compare the Radius clustering with a naive implementation using `NetworkX`.
17+
6. Comparing the results in terms of :
18+
- Execution time
19+
- Number of cluster found
20+
7. Visualizing the benchmark results.
21+
8. Visualizing the clustering results.
22+
23+
This example is useful for understanding how to implement a custom MDS solver
24+
and how to perform an advanced usage of the package.
25+
"""
26+
# Author: Haenn Quentin
27+
# SPDX-License-Identifier: MIT
28+
29+
# %%
30+
# Import necessary libraries
31+
# --------------------------
32+
#
33+
# Since this example is a benchmark, we need to import the necessary libraries
34+
# to perform the benchmark, including `NetworkX` for the naive implementation,
35+
# `matplotlib` for visualization, and `sklearn` for the datasets.
36+
37+
38+
import networkx as nx
39+
import numpy as np
40+
import matplotlib.pyplot as plt
41+
import time
42+
import warnings
43+
44+
from sklearn.datasets import fetch_openml
45+
from radius_clustering import RadiusClustering
46+
from sklearn.metrics import pairwise_distances_argmin
47+
48+
warnings.filterwarnings("ignore", category=RuntimeWarning, module="sklearn")
49+
# %%
50+
# Define a custom MDS solver
51+
# --------------------------
52+
#
53+
# We define a custom MDS solver that uses the `NetworkX` library to compute the MDS.
54+
# Note the signature of the function is identical to the one used in the `RadiusClustering` class.
55+
56+
57+
def custom_solver(n: int, edges: np.ndarray, nb_edges: int, random_state=None):
58+
"""
59+
Custom MDS solver using NetworkX to compute the MDS problem.
60+
61+
Parameters:
62+
-----------
63+
n : int
64+
The number of points in the dataset.
65+
edges : np.ndarray
66+
The edges of the graph, flattened into a 1D array.
67+
nb_edges : int
68+
The number of edges in the graph.
69+
random_state : int | None
70+
The random state to use for reproducibility.
71+
72+
Returns:
73+
--------
74+
centers : list
75+
A sorted list of the centers of the clusters.
76+
mds_exec_time : float
77+
The execution time of the MDS algorithm in seconds.
78+
"""
79+
G = nx.Graph()
80+
G.add_edges_from(edges)
81+
82+
start_time = time.time()
83+
centers = list(nx.algorithms.dominating.dominating_set(G))
84+
mds_exec_time = time.time() - start_time
85+
86+
centers = sorted(centers)
87+
88+
return centers, mds_exec_time
89+
90+
91+
# %%
92+
# Define datasets to test the clustering
93+
# --------------------------------------
94+
#
95+
# We will use 4 datasets to test the clustering:
96+
# 1. Iris dataset
97+
# 2. Wine dataset
98+
# 3. Breast Cancer dataset (WDBC)
99+
# 4. Vehicle dataset
100+
# These are common datasets used in machine learning and lead to pretty fast results.
101+
# Structure of the variable `DATASETS`:
102+
# - The key is the name of the dataset.
103+
# - The value is a tuple containing:
104+
# - The dataset fetched from OpenML.
105+
# - The radius to use for the Radius clustering. (determined in literature, see references on home page)
106+
#
107+
108+
109+
DATASETS = {
110+
"iris": (fetch_openml(name="iris", version=1, as_frame=False), 1.43),
111+
"wine": (fetch_openml(name="wine", version=1, as_frame=False), 232.09),
112+
"glass": (fetch_openml(name="glass", version=1, as_frame=False), 3.94),
113+
"ionosphere": (fetch_openml(name="ionosphere", version=1, as_frame=False), 5.46),
114+
"breast_cancer": (fetch_openml(name="wdbc", version=1, as_frame=False), 1197.42),
115+
"synthetic": (fetch_openml(name="synthetic_control", version=1, as_frame=False), 70.12),
116+
"vehicle": (fetch_openml(name="vehicle", version=1, as_frame=False), 155.05),
117+
"yeast": (fetch_openml(name="yeast", version=1, as_frame=False), 0.4235),
118+
}
119+
120+
# %%
121+
# Define the benchmark procedure
122+
# --------------------------------------
123+
#
124+
# We define a function to perform the benchmark on the datasets.
125+
# The procedure is as follows:
126+
# 1. Creates an instance of RadiusClustering for each solver.
127+
# 2. For each instance, fit the algorithm on each dataset.
128+
# 3. Store the execution time and the number of clusters found for each dataset.
129+
# 4. Return the results as a dictionary.
130+
131+
132+
def benchmark_radius_clustering():
133+
results = {}
134+
exact = RadiusClustering(manner="exact", radius=1.43)
135+
approx = RadiusClustering(manner="approx", radius=1.43)
136+
custom = RadiusClustering(
137+
manner="custom", radius=1.43
138+
)
139+
custom.set_solver(custom_solver) # Set the custom solver
140+
algorithms = [exact, approx, custom]
141+
# Loop through each algorithm and dataset
142+
for algo in algorithms:
143+
algo_results = {}
144+
time_algo = []
145+
clusters_algo = []
146+
# Loop through each dataset
147+
for name, (dataset, radius) in DATASETS.items():
148+
X = dataset.data
149+
# set the radius for the dataset considered
150+
setattr(algo, "radius", radius)
151+
# Fit the algorithm
152+
t0 = time.time()
153+
algo.fit(X)
154+
t_algo = time.time() - t0
155+
156+
# Store the results
157+
time_algo.append(t_algo)
158+
clusters_algo.append(len(algo.centers_))
159+
algo_results["time"] = time_algo
160+
algo_results["clusters"] = clusters_algo
161+
results[algo.manner] = algo_results
162+
163+
return results
164+
165+
166+
# %%
167+
# Run the benchmark and plot the results
168+
# --------------------------------------
169+
# We run the benchmark and plot the results for each dataset.
170+
171+
172+
results = benchmark_radius_clustering()
173+
174+
# Plot the results
175+
fig, axs = plt.subplot_mosaic(
176+
[
177+
["time", "time", "time", "time"],
178+
["iris", "wine", "breast_cancer", "vehicle"],
179+
["glass", "ionosphere", "synthetic", "yeast"],
180+
],
181+
layout="constrained",
182+
figsize=(12, 8),
183+
)
184+
fig.suptitle("Benchmark of Radius Clustering Solvers", fontsize=16)
185+
186+
axs['time'].set_yscale('log') # Use logarithmic scale for better visibility
187+
for algo, algo_results in results.items():
188+
# Plot execution time
189+
axs['time'].plot(
190+
DATASETS.keys(),
191+
algo_results["time"],
192+
marker='o',
193+
label=algo,
194+
)
195+
# Plot number of clusters
196+
197+
for i, (name, (dataset, _)) in enumerate(DATASETS.items()):
198+
axs[name].bar(
199+
results.keys(),
200+
[results[algo]["clusters"][i] for algo in results.keys()],
201+
label=name,
202+
)
203+
axs[name].axhline(
204+
y=len(set(dataset.target)), # Number of unique classes in the dataset
205+
label="True number of clusters",
206+
color='r',
207+
linestyle='--',
208+
)
209+
axs[name].set_title(name)
210+
axs[name].set_xlabel("Algorithms")
211+
212+
axs["iris"].set_ylabel("Number of clusters")
213+
axs["glass"].set_ylabel("Number of clusters")
214+
215+
axs['time'].set_title("Execution Time (log scale)")
216+
axs['time'].set_xlabel("Datasets")
217+
axs['time'].set_ylabel("Time (seconds)")
218+
axs['time'].legend(title="Algorithms")
219+
plt.tight_layout()
220+
plt.show()
221+
222+
223+
# %%
224+
# Conclusion
225+
# ----------
226+
#
227+
# In this example, we applied Radius clustering to the Iris and Wine datasets and compared it with KMeans clustering.
228+
# We visualized the clustering results and the difference between the two clustering algorithms.
229+
# We saw that Radius Clustering can lead to smaller clusters than kmeans, which produces much more equilibrate clusters.
230+
# The difference plot can be very useful to see where the two clustering algorithms differ.
93.7 KB
Loading
38 KB
Loading
-108 Bytes
Loading
-495 Bytes
Loading

0 commit comments

Comments
 (0)