Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions fiftyone/brain/internal/core/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,9 +234,14 @@ def filter_ids(

if patches_field is None:
if samples._is_patches:
sample_ids = np.array(samples.values("sample_id"))
sample_ids = np.array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values(..., _enforce_natural_order=False) is only able to use an optimized pathway when called on a full dataset:
https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/core/collections.py#L9648

However, filter_ids() is only called in the Brain when we're trying to filter some data down to a particular view:

So it doesn't appear to me that _enforce_natural_order=False would ever actually get used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in order to replace values(...) with values(..., _enforce_natural_order=False) somewhere, we need to verify that all usages of that method would continue to work as expected if sample_ids are unordered. Have you investigated that for filter_ids()? I cannot recall offhand whether or not that is true. The usage of sample_ids and label_ids in fiftyone-brain can be subtle though, so I would definitely tread carefully here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values(..., _enforce_natural_order=False) is only able to use an optimized pathway when called on a full dataset:

https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/core/collections.py#L9648

However, filter_ids() is only called in the Brain when we're trying to filter some data down to a particular view:

So it doesn't appear to me that _enforce_natural_order=False would ever actually get used?

self.use_view(samples)

It's called in use_view which is always called to initialize the VisualizationResult.

If you look at my PR description, I include the pymongo logs showing that enforce_natural_order is invoked

Copy link
Contributor Author

@kaixi-wang kaixi-wang Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the fiftyone server routes call it on the full dataset so this change does have significant impact on app performance https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/server/routes/embeddings.py#L65

Looking at the code for this function, any subsequent reliance on index position seems to be derived from this list so it would be safe. However. I did not thoroughly investigate everywhere it is used. I'll double check that in case the callers make some unfortunate assumptions.

Copy link
Contributor

@brimoor brimoor Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're right that use_view() always gets called when VisualizationResults are loaded, which causes filter_ids() and thus values("id") to get called.

I did some related work in #209 and #238 to prevent unnecessary values("id") calls in SimilarityIndex when the user is working with a "full index". Looks like there's a similar project to be done to avoid values("id") calls when a user is working with a "full" VisualizationResults instance.

In other words, I think the way we want VisualizationResults to work is that:

  • dataset.values("id") should not be called at all, which would be even better than dataset.values("id", _enforce_natural_order=False)
  • view.values("id") only gets called when the view is non-trivial

Copy link
Contributor Author

@kaixi-wang kaixi-wang Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. That was my initial intuition too, but seemed like a larger change and I would need to figure out if not filtering out samples that are no longer in the dataset upon initialization of the results would be problematic.

The current change would have significant,immediate benefit without that risk (provided I verify that it wouldn't break any order assumptions in where it's used)

However I'll take a look at your prs for fixing the issue with similarity and see if I can apply them to VisualizationResults as well

samples.values("sample_id", _enforce_natural_order=False),
copy=False,
)
else:
sample_ids = np.array(samples.values("id"))
sample_ids = np.array(
samples.values("id", _enforce_natural_order=False), copy=False
)

if index_sample_ids is None:
return sample_ids, None, None, None
Expand Down