Skip to content

Add support for SHA-1 arbitrarily-large objects (AKA Git objects) #1473

@Ericson2314

Description

@Ericson2314

So first of all I think your current design of focusing on Blake3 and its tree hashing for verified streaming of large objects is very good. Unquestionably on technical grounds, this is the way forward.

However, I also think that there is a lot of existing git-content-addressed data out there, and that content-addressing works best and most simply when the same content-addressing format works end-to-end. I am pretty convinced that the best way for IPFS-family stuff to get adoption is to work with this data and its current addressing scheme.

Concretely, any "linear" hash function we can instead also think of as a tree hash, just one that uses really shitty unbalanced binary trees. So the same techniques by which Blake3 hashing's intermediate steps can be looked at as a Merkle DAG, SHA-1's can too.

(For some background, I have worked on https://www.softwareheritage.org/2022/02/10/building-bridge-to-the-software-heritage-archive/ https://github.com/ipfs/devgrants/blob/master/open-grants/open-proposal-nix-ipfs.md. The former is completely done, the latter was also completely but only more recently is getting upstreamed, see https://github.com/NixOS/rfcs/blob/master/rfcs/0133-git-hashing.md and NixOS/nix#8919. The stumbling blocks have always been (1) pipeline latency with vanilla bitswap, the (2) MTU inducing a max object size. (2) is the more fundamental issue. I have brought up protocol/beyond-bitswap#30 what I am proposing here before, but lack the ability to make it happen on my own. I have also co-mentored the GSOC project for https://github.com/theupdateframework/tap19-ipfs-poc)

I get what I am asking for might sound like "hi I see you support IPv6, can you also please support IPv4", but I maintain it is not that bad because SHA-1 cannot zombie onward in perpetuity they way IPv6 can. And likewise, I am not asking for SHA-256, but because SHA-256 being much healthier than SHA-1 does have that "zombie onward" potential.

If you are willing to do this, as a token of my gratitude I would gladly do what I can to help convince Nix, Software Heritage, The Update Framework, and even Git to support Blake3 hashing for content-addressing source code. Again, I totally believe that proper balanced tree hashing is the right way forward on technical grounds. I just think people need to see how nice end-to-end content-addressing is in order to overcome all the technical debt to get us there.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions