-
Notifications
You must be signed in to change notification settings - Fork 11
Optimize filter_ids #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Optimize filter_ids #257
Conversation
if patches_field is None: | ||
if samples._is_patches: | ||
sample_ids = np.array(samples.values("sample_id")) | ||
sample_ids = np.array( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values(..., _enforce_natural_order=False)
is only able to use an optimized pathway when called on a full dataset:
https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/core/collections.py#L9648
However, filter_ids()
is only called in the Brain when we're trying to filter some data down to a particular view:
fiftyone-brain/fiftyone/brain/similarity.py
Lines 649 to 653 in 1d68783
if sample_ids is not None and not self.has_view: keep_inds = None good_inds = None else: sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids( sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids(
So it doesn't appear to me that _enforce_natural_order=False
would ever actually get used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in order to replace values(...)
with values(..., _enforce_natural_order=False)
somewhere, we need to verify that all usages of that method would continue to work as expected if sample_ids
are unordered. Have you investigated that for filter_ids()
? I cannot recall offhand whether or not that is true. The usage of sample_ids
and label_ids
in fiftyone-brain
can be subtle though, so I would definitely tread carefully here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values(..., _enforce_natural_order=False)
is only able to use an optimized pathway when called on a full dataset:However,
filter_ids()
is only called in the Brain when we're trying to filter some data down to a particular view:
fiftyone-brain/fiftyone/brain/similarity.py
Lines 649 to 653 in 1d68783
if sample_ids is not None and not self.has_view: keep_inds = None good_inds = None else: sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids(
sample_ids, label_ids, keep_inds, good_inds = fbu.filter_ids( So it doesn't appear to me that
_enforce_natural_order=False
would ever actually get used?
self.use_view(samples) |
It's called in use_view which is always called to initialize the VisualizationResult.
If you look at my PR description, I include the pymongo logs showing that enforce_natural_order is invoked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the fiftyone server routes call it on the full dataset so this change does have significant impact on app performance https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/server/routes/embeddings.py#L65
Looking at the code for this function, any subsequent reliance on index position seems to be derived from this list so it would be safe. However. I did not thoroughly investigate everywhere it is used. I'll double check that in case the callers make some unfortunate assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you're right that use_view()
always gets called when VisualizationResults
are loaded, which causes filter_ids()
and thus values("id")
to get called.
I did some related work in #209 and #238 to prevent unnecessary values("id")
calls in SimilarityIndex
when the user is working with a "full index". Looks like there's a similar project to be done to avoid values("id")
calls when a user is working with a "full" VisualizationResults
instance.
In other words, I think the way we want VisualizationResults
to work is that:
dataset.values("id")
should not be called at all, which would be even better thandataset.values("id", _enforce_natural_order=False)
view.values("id")
only gets called when the view is non-trivial
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. That was my initial intuition too, but seemed like a larger change and I would need to figure out if not filtering out samples that are no longer in the dataset upon initialization of the results would be problematic.
The current change would have significant,immediate benefit without that risk (provided I verify that it wouldn't break any order assumptions in where it's used)
However I'll take a look at your prs for fixing the issue with similarity and see if I can apply them to VisualizationResults as well
Optimize call to get sample_ids by forcing
values()
to use the index if possible and also avoid copying into a new array.If you enable pymongo.command debug logs, you can see the significant different in query execution times: