Skip to content

Conversation

kaixi-wang
Copy link
Contributor

@kaixi-wang kaixi-wang commented Jul 23, 2025

Optimize call to get sample_ids by forcing values() to use the index if possible and also avoid copying into a new array.

import fiftyone as fo
ds=fo.load_dataset('places')
res=ds.load_brain_results('viz')
len(res.sample_ids). # 2168412

# (note all caching is disabled for these runs)
# current time:
7min 12s ± 8.56 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

# after this change:
53 s ± 2.87 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

If you enable pymongo.command debug logs, you can see the significant different in query execution times:

# current:
{
 'message': 'Command succeeded',
 'durationMS': 50653.55,
 'reply': '{"cursor": {"nextBatch": [{"_id": {"$oid": "67b92f9a0eff73f0a1abc480"}, "value0": "67b92f9a0eff73f0a1abc480"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc484"}, "value0": "67b92f9a0eff73f0a1abc484"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc486"}, "value0": "67b92f9a0eff73f0a1abc486"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc487"}, "value0": "67b92f9a0eff73f0a1abc487"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc488"}, "value0": "67b92f9a0eff73f0a1abc488"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc48a"}, "value0": "67b92f9a0eff73f0a1abc48a"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc489"}, "value0": "67b92f9a0eff73f0a1abc489"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc48b"}, "value0": "67b92f9a0eff73f0a1abc48b"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc485"}, "value0": "67b92f9a0eff73f0a1abc485"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc48c"}, "value0": "67b92f9a0eff73f0a1abc48c"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc48d"}, "value0": "67b92f9a0eff73f0a1abc48d"}, {"_id": {"$oid": "67b92f9a0eff73f0a1abc...',
 'commandName': 'getMore',
 'databaseName': 'kacey-ephem',
 'driverConnectionId': 3,
 'serverHost': 'mongodb.fiftyone.ai',
 'serverPort': 27017}


# using index
{
 'message': 'Command succeeded',
 'durationMS': 2951.7090000000003,
 'reply': '{"cursor": {"nextBatch": [{"_id": {"$oid": "67b92f300eff73f0a1a6b7ef"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f0"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f1"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f2"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f3"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f4"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f5"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f6"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f7"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f8"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7f9"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fa"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fb"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fc"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fd"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7fe"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b7ff"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b800"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b801"}}, {"_id": {"$oid": "67b92f300eff73f0a1a6b802"}}, {"_id": {"$oid": "67b92f300eff73f0...',
 'commandName': 'getMore',
 'databaseName': 'kacey-ephem',
 'driverConnectionId': 3,
 'serverHost': 'mongodb.fiftyone.ai',
 'serverPort': 27017}

if patches_field is None:
if samples._is_patches:
sample_ids = np.array(samples.values("sample_id"))
sample_ids = np.array(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values(..., _enforce_natural_order=False) is only able to use an optimized pathway when called on a full dataset:
https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/core/collections.py#L9648

However, filter_ids() is only called in the Brain when we're trying to filter some data down to a particular view:

So it doesn't appear to me that _enforce_natural_order=False would ever actually get used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in order to replace values(...) with values(..., _enforce_natural_order=False) somewhere, we need to verify that all usages of that method would continue to work as expected if sample_ids are unordered. Have you investigated that for filter_ids()? I cannot recall offhand whether or not that is true. The usage of sample_ids and label_ids in fiftyone-brain can be subtle though, so I would definitely tread carefully here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

values(..., _enforce_natural_order=False) is only able to use an optimized pathway when called on a full dataset:

https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/core/collections.py#L9648

However, filter_ids() is only called in the Brain when we're trying to filter some data down to a particular view:

So it doesn't appear to me that _enforce_natural_order=False would ever actually get used?

self.use_view(samples)

It's called in use_view which is always called to initialize the VisualizationResult.

If you look at my PR description, I include the pymongo logs showing that enforce_natural_order is invoked

Copy link
Contributor Author

@kaixi-wang kaixi-wang Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the fiftyone server routes call it on the full dataset so this change does have significant impact on app performance https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/server/routes/embeddings.py#L65

Looking at the code for this function, any subsequent reliance on index position seems to be derived from this list so it would be safe. However. I did not thoroughly investigate everywhere it is used. I'll double check that in case the callers make some unfortunate assumptions.

Copy link
Contributor

@brimoor brimoor Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're right that use_view() always gets called when VisualizationResults are loaded, which causes filter_ids() and thus values("id") to get called.

I did some related work in #209 and #238 to prevent unnecessary values("id") calls in SimilarityIndex when the user is working with a "full index". Looks like there's a similar project to be done to avoid values("id") calls when a user is working with a "full" VisualizationResults instance.

In other words, I think the way we want VisualizationResults to work is that:

  • dataset.values("id") should not be called at all, which would be even better than dataset.values("id", _enforce_natural_order=False)
  • view.values("id") only gets called when the view is non-trivial

Copy link
Contributor Author

@kaixi-wang kaixi-wang Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. That was my initial intuition too, but seemed like a larger change and I would need to figure out if not filtering out samples that are no longer in the dataset upon initialization of the results would be problematic.

The current change would have significant,immediate benefit without that risk (provided I verify that it wouldn't break any order assumptions in where it's used)

However I'll take a look at your prs for fixing the issue with similarity and see if I can apply them to VisualizationResults as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants