Why choosing with size (len(dataset) * portion) / K / round

Hi, thanks for the interesting work!

I'm reading the code and there is a detail I couldn't understand.

When selecting data from each cluster, the corresponding code is
```python
size = (len(dataset) * portion) / K / round
exp_reward_diff = merged_df["exp_reward_diff"]
  
# random select from K clusters, with p as the weight.
select_new_iter = np.random.choice(
    K, size=int(size), p=exp_reward_diff, replace=True
)
# Count how many times a cluster is chosen
selected_clusters_size = Counter(select_new_iter)

remaining_dataset = dataset.select(set(range(len(dataset))) - set(selected_indices))
remaining_dataset_df = remaining_dataset.to_pandas()

new_indices = []
for i in range(K):
    # get current indices in the remaining dataset
    indices = remaining_dataset_df[remaining_dataset_df["cluster"] == i]["index"]
    # adjust size if the selected size exceeds the remaining size
    size = min(selected_clusters_size[i], len(indices))
    # pick real samples from each cluster
    indices = np.random.choice(indices, size=size, replace=False)
    new_indices.extend(indices)
new_indices = np.array(new_indices)
# update the selected samples
new_indices = np.concatenate([selected_indices, new_indices])
```

If I understand correctly, in this iteration, the chosen size is `(len(dataset) * portion) / K / round`, then the code selects from clusters with weight, and `Conuter` is used to count how many samples are chosen from a cluster, the subsequent for loop is used to choose samples from K clusters. This results in `(len(dataset) * portion) / K / round` samples in total. But in the paper, the size for each iteration should be $b_{it} = \frac{b}{N}$, so I guess no division by K? 

It would be great if you can help me understand this detail.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Why choosing with size (len(dataset) * portion) / K / round #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Why choosing with size (len(dataset) * portion) / K / round #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions