Optimize filter_ids #257

kaixi-wang · 2025-07-23T04:44:06Z

Optimize call to get sample_ids by forcing values() to use the index if possible and also avoid copying into a new array.

import fiftyone as fo
ds=fo.load_dataset('places')
res=ds.load_brain_results('viz')
len(res.sample_ids). # 2168412

# (note all caching is disabled for these runs)
# current time:
7min 12s ± 8.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# after this change:
53 s ± 2.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you enable pymongo.command debug logs, you can see the significant different in query execution times:

# current:
{
 'message': 'Command succeeded',
 'durationMS': 50653.55,
 'reply': '{"cursor": {"nextBatch": [{"_id": {"$oid": "67b92f9a0eff73f0a1abc480"}, "value0": "67b92f9a0eff73f0a1abc480"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc484"}, "value0": "67b92f9a0eff73f0a1abc484"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc486"}, "value0": "67b92f9a0eff73f0a1abc486"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc487"}, "value0": "67b92f9a0eff73f0a1abc487"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc488"}, "value0": "67b92f9a0eff73f0a1abc488"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc48a"}, "value0": "67b92f9a0eff73f0a1abc48a"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc489"}, "value0": "67b92f9a0eff73f0a1abc489"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc48b"}, "value0": "67b92f9a0eff73f0a1abc48b"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc485"}, "value0": "67b92f9a0eff73f0a1abc485"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc48c"}, "value0": "67b92f9a0eff73f0a1abc48c"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc48d"}, "value0": "67b92f9a0eff73f0a1abc48d"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc...',
 'commandName': 'getMore',
 'databaseName': 'kacey-ephem',
 'driverConnectionId': 3,
 'serverHost': 'mongodb.fiftyone.ai',
 'serverPort': 27017}


# using index
{
 'message': 'Command succeeded',
 'durationMS': 2951.7090000000003,
 'reply': '{"cursor": {"nextBatch": [{"_id": {"$oid": "67b92f300eff73f0a1a6b7ef"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f0"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f1"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f2"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f3"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f4"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f5"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f6"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f7"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f8"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f9"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fa"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fb"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fc"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fd"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fe"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7ff"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b800"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b801"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b802"}}, {"_id": {"$oid": "67b92f300eff73f0...',
 'commandName': 'getMore',
 'databaseName': 'kacey-ephem',
 'driverConnectionId': 3,
 'serverHost': 'mongodb.fiftyone.ai',
 'serverPort': 27017}

brimoor · 2025-07-23T13:46:55Z

fiftyone/brain/internal/core/utils.py

    if patches_field is None:
        if samples._is_patches:
-            sample_ids = np.array(samples.values("sample_id"))
+            sample_ids = np.array(


values(..., _enforce_natural_order=False) is only able to use an optimized pathway when called on a full dataset:
https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/core/collections.py#L9648

However, filter_ids() is only called in the Brain when we're trying to filter some data down to a particular view:

fiftyone-brain/fiftyone/brain/similarity.py

Lines 649 to 653 in 1d68783

if sample_ids is not None and not self.has_view:

keep_inds = None

good_inds = None

else:

sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids(

fiftyone-brain/fiftyone/brain/visualization.py

Line 528 in 1d68783

sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids(

So it doesn't appear to me that _enforce_natural_order=False would ever actually get used?

Also, in order to replace values(...) with values(..., _enforce_natural_order=False) somewhere, we need to verify that all usages of that method would continue to work as expected if sample_ids are unordered. Have you investigated that for filter_ids()? I cannot recall offhand whether or not that is true. The usage of sample_ids and label_ids in fiftyone-brain can be subtle though, so I would definitely tread carefully here.

values(..., _enforce_natural_order=False) is only able to use an optimized pathway when called on a full dataset:

https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/core/collections.py#L9648

However, filter_ids() is only called in the Brain when we're trying to filter some data down to a particular view:

fiftyone-brain/fiftyone/brain/similarity.py

Lines 649 to 653 in 1d68783

if sample_ids is not None and not self.has_view:

keep_inds = None

good_inds = None

else:

sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids(

fiftyone-brain/fiftyone/brain/visualization.py

Line 528 in 1d68783

sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids(

So it doesn't appear to me that _enforce_natural_order=False would ever actually get used?

fiftyone-brain/fiftyone/brain/visualization.py

Line 389 in 1d68783

self.use_view(samples)

It's called in use_view which is always called to initialize the VisualizationResult.

If you look at my PR description, I include the pymongo logs showing that enforce_natural_order is invoked

Also the fiftyone server routes call it on the full dataset so this change does have significant impact on app performance https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/server/routes/embeddings.py#L65

Looking at the code for this function, any subsequent reliance on index position seems to be derived from this list so it would be safe. However. I did not thoroughly investigate everywhere it is used. I'll double check that in case the callers make some unfortunate assumptions.

Hmm you're right that use_view() always gets called when VisualizationResults are loaded, which causes filter_ids() and thus values("id") to get called.

I did some related work in #209 and #238 to prevent unnecessary values("id") calls in SimilarityIndex when the user is working with a "full index". Looks like there's a similar project to be done to avoid values("id") calls when a user is working with a "full" VisualizationResults instance.

In other words, I think the way we want VisualizationResults to work is that:

dataset.values("id") should not be called at all, which would be even better than dataset.values("id", _enforce_natural_order=False)

view.values("id") only gets called when the view is non-trivial

Agree. That was my initial intuition too, but seemed like a larger change and I would need to figure out if not filtering out samples that are no longer in the dataset upon initialization of the results would be problematic.

The current change would have significant,immediate benefit without that risk (provided I verify that it wouldn't break any order assumptions in where it's used)

However I'll take a look at your prs for fixing the issue with similarity and see if I can apply them to VisualizationResults as well

use index to retrieve ids

626a5c7

brimoor reviewed Jul 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize filter_ids #257

Optimize filter_ids #257

Uh oh!

kaixi-wang commented Jul 23, 2025 •

edited

Loading

Uh oh!

brimoor Jul 23, 2025

Uh oh!

brimoor Jul 23, 2025

Uh oh!

kaixi-wang Jul 23, 2025

Uh oh!

kaixi-wang Jul 23, 2025 •

edited

Loading

Uh oh!

brimoor Jul 23, 2025 •

edited

Loading

Uh oh!

kaixi-wang Jul 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

	if sample_ids is not None and not self.has_view:
	keep_inds = None
	good_inds = None
	else:
	sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids(

Optimize filter_ids #257

Are you sure you want to change the base?

Optimize filter_ids #257

Uh oh!

Conversation

kaixi-wang commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brimoor Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

brimoor Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

kaixi-wang Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

kaixi-wang Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brimoor Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaixi-wang Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kaixi-wang commented Jul 23, 2025 •

edited

Loading

kaixi-wang Jul 23, 2025 •

edited

Loading

brimoor Jul 23, 2025 •

edited

Loading

kaixi-wang Jul 23, 2025 •

edited

Loading