Skip to content

Why choosing with size (len(dataset) * portion) / K / round #8

@JessePrince

Description

@JessePrince

Hi, thanks for the interesting work!

I'm reading the code and there is a detail I couldn't understand.

When selecting data from each cluster, the corresponding code is

size = (len(dataset) * portion) / K / round
exp_reward_diff = merged_df["exp_reward_diff"]
  
# random select from K clusters, with p as the weight.
select_new_iter = np.random.choice(
    K, size=int(size), p=exp_reward_diff, replace=True
)
# Count how many times a cluster is chosen
selected_clusters_size = Counter(select_new_iter)

remaining_dataset = dataset.select(set(range(len(dataset))) - set(selected_indices))
remaining_dataset_df = remaining_dataset.to_pandas()

new_indices = []
for i in range(K):
    # get current indices in the remaining dataset
    indices = remaining_dataset_df[remaining_dataset_df["cluster"] == i]["index"]
    # adjust size if the selected size exceeds the remaining size
    size = min(selected_clusters_size[i], len(indices))
    # pick real samples from each cluster
    indices = np.random.choice(indices, size=size, replace=False)
    new_indices.extend(indices)
new_indices = np.array(new_indices)
# update the selected samples
new_indices = np.concatenate([selected_indices, new_indices])

If I understand correctly, in this iteration, the chosen size is (len(dataset) * portion) / K / round, then the code selects from clusters with weight, and Conuter is used to count how many samples are chosen from a cluster, the subsequent for loop is used to choose samples from K clusters. This results in (len(dataset) * portion) / K / round samples in total. But in the paper, the size for each iteration should be $b_{it} = \frac{b}{N}$, so I guess no division by K?

It would be great if you can help me understand this detail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions