Skip to content

Conversation

gtystahl
Copy link

Adding current iteration of python stabilizers for review!

Copy link

google-cla bot commented Sep 10, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@msuozzo msuozzo self-requested a review September 10, 2025 19:39
for _, mf := range mr.File {
s.(ZipEntryStabilizer).Stabilize(mf)
}
}
newArchiveHash := getArchiveHash(&mr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat! let's separate this out into a different PR. it's a nice feature to add but we can probably review it separately

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I'm moved it to a separate PR.

@@ -575,7 +576,14 @@ var runBenchmark = &cobra.Command{
}
return
}
bar := pb.New(set.Count)

// Added since the count was always 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think this just means our other benchmark generation logic neglected to set Count and we should fix it there. this can also be moved to a separate change.

entries []RecordDistEntry
}

func ParseMetadataDistInfo(r io.Reader) (*MetadataDistInfo, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's not perfect but net/mail.ReadMessage should be able to parse out the headers here. It will always produce []string so if we want to assert uniqueness, we'd need to do that separately. Looks like writing it out would also need to be custom (but shouldn't be too tricky).

}
println("Processing file:", zf.Name)

zf.SetContent([]byte("This needed to change (metadata)"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we want here? I remember this was a source of skew (maybe tools aren't generating these anymore?) but we should reconcile the impl and the stabilizer name: either call this "RemoveMetadataJSON" or keep the file around and apply some stabilizer to it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just renamed the stabilizer. Maybe we can think something to stabilize the file instead of removing. I can investigate on how to stabilize instead of removing it

}

// Parse the METADATA file
manifest, err := ParseMetadataDistInfo(r)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm so actually per my comment above, you may not even need the structured golang object: You could just extract the 3-4 keys below that we special-case, and then unpack the rest naively from the mail.Header map.

if mayContainGeneratedDocstring(string(originalContent)) {
println("File likely contains auto-generated docstrings, applying stabilizer:", zf.Name)

cleanedContent, err := RemovePythonComments(originalContent)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, we could be a bit more conservative and just remove the aspects of the generated docstrings that we see create issues. if it's isolated to the args and returns sections, that would really reduce the blast radius of this stabilizer (and, consequently, its risk).

return
}
// Replace all \r\n with \n
normalized := bytes.ReplaceAll(data, []byte("\r\n"), []byte("\n"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to try the git version of this in the build script before we try to hack it in the stabilizer: git config --global core.autocrlf true

Let's move this to a separate change until we can see if we need it.

// println("Archive SHA256 (base64):", originalArchiveHash)

// Define the pattern to find
patternToFind := regexp.MustCompile(`(?m)^TYPE_CHECKING = False\nif TYPE_CHECKING:\n(\s*)from typing import Tuple, Union\n(\s*)VERSION_TUPLE = Tuple\[Union\[int, str\], \.\.\.\]$`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be curious as to where this is used and what set of issues this is meant to address. If there's a specific library generating this, we should document the version that produces these differing formats.

Name: "version-file",
Func: func(zr *MutableZipReader) {

// Compute a hash of the entire archive content for debugging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code. here and below

// originalArchiveHash := getArchiveHash(zr)
for _, zf := range zr.File {
// Only process METADATA files
if !strings.HasSuffix(zf.Name, "_version.py") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is a safe change. I'd be curious to know how effective/necessary this is to improve outcomes. in addition to being quite blunt, we also over-match paths here that might have text before the underscore e.g. "foo_version.py"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This was very experimental. We definitely need a different approach here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants