Skip to content

Conversation

@razdoburdin
Copy link
Contributor

Current version of xgboost utilizes fixed block_size = 256 for hist building.

This PR make this value an adaptive function of model parameters and CPU cache size. The change is important mostly for ColsWiseBuildHistKernel and demonstrates up to 2x speed-up for epsilon dataset.

@razdoburdin razdoburdin marked this pull request as draft November 12, 2025 17:18
@trivialfis
Copy link
Member

Thank you for the optimizations! The code looks reasonable, but please add comments when the PR is ready for review. (and ping me).

std::size_t occupied_space = (hist_fit_to_l1 ? hist_size : 0) + offsets_size + idx_bin_size;
space_in_l1_for_rows = usable_l1_size > occupied_space ? usable_l1_size - occupied_space : 0;
}
std::size_t block_size = std::max<std::size_t>(1, space_in_l1_for_rows / l1_row_foot_print);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously block_size was always 256 rows, which is quite large. And now it is 1 row in case no more rows fit into L1. Won't this change affect the performance in the case when there are no enough space for rows in L1?
Should it be max(256, space_in_l1_for_rows / l1_row_foot_print) ?
Or maybe L2 size should be used to calculate the block_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, cacheline_size / (2 * sizeof(float) = 8, would be the best value in this case. Using L2 would result to a huge block_size (~1e4-1e5) and produce potential underutilization of CPU cores (blocks are processed in parallel, and if blocks a very big, than some cores would be out of job).

@razdoburdin razdoburdin marked this pull request as ready for review November 13, 2025 16:20
@razdoburdin
Copy link
Contributor Author

Hi @trivialfis , this PR is ready for review.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to explain the cache info in code comments? Also, the construction of hist space on how and why it depends on the cache size?

@razdoburdin
Copy link
Contributor Author

Would you like to explain the cache info in code comments? Also, the construction of hist space on how and why it depends on the cache size?

done

GetCacheInfo(cache_num++, &type, &level, &sets, &line_size, &partitions, &ways);
if (!trust_cpuid) return trust_cpuid;

if (type == kCpuidTypeNull) break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cache_sizes[idx] valid if we break here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we use default values from SetDefaultCaches

size_t hist_size = 2 * sizeof(double) * nbins;
const bool hist_fit_to_l2 = 0.8 * cache_manager_.L2Size() > hist_size;

bool read_by_column = !hist_fit_to_l2 && gidx.IsDense();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment on how this decision is made?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

std::size_t n_bins = gidx.cut.Ptrs().back();
std::size_t n_columns = gidx.cut.Ptrs().size() - 1;
bool any_missing = !gidx.IsDense();
std::size_t hist_size = 2 * sizeof(double) * n_bins;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using sizeof(GradientPair) and sizeof(GradientPairPrecise) instead of sizeof(float) * 2 (for all sizeof calls in this PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/

/* First step: determine whether one histogram column fits into L1.
* The maximum number of elements in a column is 2^8, 2^16, or 2^32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on what it means to be the maximum number of elements in a (histogram) column? I thought that's the number of histogram bins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, bins is a correct term. I have fixed the description.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating the comments. It's still not clear to me what it means to have "maximum number of bins" in a column. So, what happens if I specify the training parameter max_bin=53?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants