-
Couldn't load subscription status.
- Fork 17
Crawler only and prioritization #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
rules/options.go
Outdated
|
|
||
| // HighPriority places priority on fields | ||
| // that are associated with a rule during | ||
| // a crawler run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to mention that normal watcher processing is also done for these rules.
Do we want to consider an uint here, and let the use specific a specific priority order? 1-n, 1 being the highest? Would allow them to "group" priority 1 rules or which ever way they want. getPrioritizedPrefixes could just do a value sort. Rules that overlap, highest priority wins.
rules/engine.go
Outdated
| // This is a map; used to ensure there are no duplicates | ||
| for prefix := range prefixes { | ||
| prefixSlice = append(prefixSlice, prefix) | ||
| for prefix := range e.ruleMgr.watcherPrefixes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather have a method for access to this, so all of the internals is kept within the rule mgr.
rules/rule_manager.go
Outdated
| add = false | ||
| // use the highest priority of any | ||
| // overlapping prefixes | ||
| if priorityCheck > priority { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, this looks like it's in the wrong spot, with add=false right above it, the check outside the loop, if add, will cause the priority to never get used?
Maybe add a test just for this case, overlap and higher priority later in the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh oops, yeah
v3enginetest/main.go
Outdated
| check(err) | ||
|
|
||
| // Verify the crawler rule ran | ||
| tenSecCtx2, cancel := context.WithTimeout(context.Background(), 15*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15 vs tenSec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, don't think there's a way for the callback to know the source watcher vs crawler. I don't know if we hide that in the rule callback attributes or not.
Or if you could check the task context for the info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check to the callback
rules/rule_manager.go
Outdated
| rulesBySlashCount: map[int]map[DynamicRule]int{}, | ||
| prefixes: map[string]string{}, | ||
| prefixes: map[string]uint{}, | ||
| watcherPrefixes: map[string]uint{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to store these here, would be a dup list in most cases?
Could we instead use a
prefixes: map[string]struct{crawlerOnly: bool, priority: uint} and can generate any combination from that?
rules/rule_manager.go
Outdated
| func (rm *ruleManager) getStaticRules(key string, value *string) (map[staticRule]int, []staticRule) { | ||
| slashCount := strings.Count(key, "/") | ||
| out := make(map[staticRule]int) | ||
| prioritized := make(map[staticRule]uint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are really "prioritized" there just input to be sorted later, maybe we need a diff name here.
rules/rule_manager.go
Outdated
| } | ||
| } | ||
| return out | ||
| return out, sortRulesByPriority(prioritized) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Humm, I worry about the overhead of doing this for every call when the result is always the same. Every field value, which can be large, 20 million each crawl bootstrap.
Maybe we need some debug logging or enable Go profiling to see how much of a hit this is going to be.
Issue https://github.ibm.com/alchemy-containers/armada-ballast/issues/3593
ENSURE THE FOLLOWING ARE MET:
Below is a brief summary of PR requirements (full list).
By opening this PR for review, the author has agreed that these criteria must be met.
By approving this PR, the reviewers have also agreed these criteria have been met and it is ready to be merged.