-
Notifications
You must be signed in to change notification settings - Fork 11
Optimize filter_ids #257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
kaixi-wang
wants to merge
1
commit into
develop
Choose a base branch
from
kacey/optimize-filter-ids
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Optimize filter_ids #257
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values(..., _enforce_natural_order=False)
is only able to use an optimized pathway when called on a full dataset:https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/core/collections.py#L9648
However,
filter_ids()
is only called in the Brain when we're trying to filter some data down to a particular view:fiftyone-brain/fiftyone/brain/similarity.py
Lines 649 to 653 in 1d68783
fiftyone-brain/fiftyone/brain/visualization.py
Line 528 in 1d68783
So it doesn't appear to me that
_enforce_natural_order=False
would ever actually get used?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in order to replace
values(...)
withvalues(..., _enforce_natural_order=False)
somewhere, we need to verify that all usages of that method would continue to work as expected ifsample_ids
are unordered. Have you investigated that forfilter_ids()
? I cannot recall offhand whether or not that is true. The usage ofsample_ids
andlabel_ids
infiftyone-brain
can be subtle though, so I would definitely tread carefully here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fiftyone-brain/fiftyone/brain/visualization.py
Line 389 in 1d68783
It's called in use_view which is always called to initialize the VisualizationResult.
If you look at my PR description, I include the pymongo logs showing that enforce_natural_order is invoked
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the fiftyone server routes call it on the full dataset so this change does have significant impact on app performance https://github.com/voxel51/fiftyone/blob/173aa7dd1b05b4c530d240d542f3b6367371f631/fiftyone/server/routes/embeddings.py#L65
Looking at the code for this function, any subsequent reliance on index position seems to be derived from this list so it would be safe. However. I did not thoroughly investigate everywhere it is used. I'll double check that in case the callers make some unfortunate assumptions.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you're right that
use_view()
always gets called whenVisualizationResults
are loaded, which causesfilter_ids()
and thusvalues("id")
to get called.I did some related work in #209 and #238 to prevent unnecessary
values("id")
calls inSimilarityIndex
when the user is working with a "full index". Looks like there's a similar project to be done to avoidvalues("id")
calls when a user is working with a "full"VisualizationResults
instance.In other words, I think the way we want
VisualizationResults
to work is that:dataset.values("id")
should not be called at all, which would be even better thandataset.values("id", _enforce_natural_order=False)
view.values("id")
only gets called when the view is non-trivialUh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. That was my initial intuition too, but seemed like a larger change and I would need to figure out if not filtering out samples that are no longer in the dataset upon initialization of the results would be problematic.
The current change would have significant,immediate benefit without that risk (provided I verify that it wouldn't break any order assumptions in where it's used)
However I'll take a look at your prs for fixing the issue with similarity and see if I can apply them to VisualizationResults as well