-
Notifications
You must be signed in to change notification settings - Fork 34
Pr pypi stabs #769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Pr pypi stabs #769
Conversation
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
pkg/archive/zip.go
Outdated
for _, mf := range mr.File { | ||
s.(ZipEntryStabilizer).Stabilize(mf) | ||
} | ||
} | ||
newArchiveHash := getArchiveHash(&mr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat! let's separate this out into a different PR. it's a nice feature to add but we can probably review it separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I'm moved it to a separate PR.
@@ -575,7 +576,14 @@ var runBenchmark = &cobra.Command{ | |||
} | |||
return | |||
} | |||
bar := pb.New(set.Count) | |||
|
|||
// Added since the count was always 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think this just means our other benchmark generation logic neglected to set Count and we should fix it there. this can also be moved to a separate change.
pkg/archive/pypi.go
Outdated
entries []RecordDistEntry | ||
} | ||
|
||
func ParseMetadataDistInfo(r io.Reader) (*MetadataDistInfo, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's not perfect but net/mail.ReadMessage
should be able to parse out the headers here. It will always produce []string so if we want to assert uniqueness, we'd need to do that separately. Looks like writing it out would also need to be custom (but shouldn't be too tricky).
} | ||
println("Processing file:", zf.Name) | ||
|
||
zf.SetContent([]byte("This needed to change (metadata)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we want here? I remember this was a source of skew (maybe tools aren't generating these anymore?) but we should reconcile the impl and the stabilizer name: either call this "RemoveMetadataJSON" or keep the file around and apply some stabilizer to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just renamed the stabilizer. Maybe we can think something to stabilize the file instead of removing. I can investigate on how to stabilize instead of removing it
} | ||
|
||
// Parse the METADATA file | ||
manifest, err := ParseMetadataDistInfo(r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm so actually per my comment above, you may not even need the structured golang object: You could just extract the 3-4 keys below that we special-case, and then unpack the rest naively from the mail.Header map.
if mayContainGeneratedDocstring(string(originalContent)) { | ||
println("File likely contains auto-generated docstrings, applying stabilizer:", zf.Name) | ||
|
||
cleanedContent, err := RemovePythonComments(originalContent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, we could be a bit more conservative and just remove the aspects of the generated docstrings that we see create issues. if it's isolated to the args and returns sections, that would really reduce the blast radius of this stabilizer (and, consequently, its risk).
return | ||
} | ||
// Replace all \r\n with \n | ||
normalized := bytes.ReplaceAll(data, []byte("\r\n"), []byte("\n")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to try the git version of this in the build script before we try to hack it in the stabilizer: git config --global core.autocrlf true
Let's move this to a separate change until we can see if we need it.
// println("Archive SHA256 (base64):", originalArchiveHash) | ||
|
||
// Define the pattern to find | ||
patternToFind := regexp.MustCompile(`(?m)^TYPE_CHECKING = False\nif TYPE_CHECKING:\n(\s*)from typing import Tuple, Union\n(\s*)VERSION_TUPLE = Tuple\[Union\[int, str\], \.\.\.\]$`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be curious as to where this is used and what set of issues this is meant to address. If there's a specific library generating this, we should document the version that produces these differing formats.
pkg/archive/pypi.go
Outdated
Name: "version-file", | ||
Func: func(zr *MutableZipReader) { | ||
|
||
// Compute a hash of the entire archive content for debugging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented code. here and below
// originalArchiveHash := getArchiveHash(zr) | ||
for _, zf := range zr.File { | ||
// Only process METADATA files | ||
if !strings.HasSuffix(zf.Name, "_version.py") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this is a safe change. I'd be curious to know how effective/necessary this is to improve outcomes. in addition to being quite blunt, we also over-match paths here that might have text before the underscore e.g. "foo_version.py"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. This was very experimental. We definitely need a different approach here
Adding current iteration of python stabilizers for review!