-
Couldn't load subscription status.
- Fork 6
Description
Hi, thanks for the interesting work!
I'm reading the code and there is a detail I couldn't understand.
When selecting data from each cluster, the corresponding code is
size = (len(dataset) * portion) / K / round
exp_reward_diff = merged_df["exp_reward_diff"]
# random select from K clusters, with p as the weight.
select_new_iter = np.random.choice(
K, size=int(size), p=exp_reward_diff, replace=True
)
# Count how many times a cluster is chosen
selected_clusters_size = Counter(select_new_iter)
remaining_dataset = dataset.select(set(range(len(dataset))) - set(selected_indices))
remaining_dataset_df = remaining_dataset.to_pandas()
new_indices = []
for i in range(K):
# get current indices in the remaining dataset
indices = remaining_dataset_df[remaining_dataset_df["cluster"] == i]["index"]
# adjust size if the selected size exceeds the remaining size
size = min(selected_clusters_size[i], len(indices))
# pick real samples from each cluster
indices = np.random.choice(indices, size=size, replace=False)
new_indices.extend(indices)
new_indices = np.array(new_indices)
# update the selected samples
new_indices = np.concatenate([selected_indices, new_indices])If I understand correctly, in this iteration, the chosen size is (len(dataset) * portion) / K / round, then the code selects from clusters with weight, and Conuter is used to count how many samples are chosen from a cluster, the subsequent for loop is used to choose samples from K clusters. This results in (len(dataset) * portion) / K / round samples in total. But in the paper, the size for each iteration should be
It would be great if you can help me understand this detail.