Skip to content

Commit 77dd841

Browse files
committed
docs: explain why CIDs are not file hashes
- add "Why the hashes differ" section explaining root block hashing - document when CID hash equals file hash (raw codec, single block) - list factors affecting CID: chunk size, DAG layout, codec, version, hash - explain flexibility as feature with tradeoffs for different use cases - add DAG Explorer link alongside CID Inspector in quickstart - link quickstart caveat to detailed explanation in concepts - add public gateways link to public-utilities page
1 parent a05e982 commit 77dd841

File tree

2 files changed

+45
-2
lines changed

2 files changed

+45
-2
lines changed

docs/concepts/content-addressing.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,47 @@ shasum: WARNING: 1 computed checksum did NOT match
8787

8888
As we can see, the hash included in the CID does NOT match the hash of the input file `ubuntu-20.04.1-desktop-amd64.iso`.
8989

90+
### Why the hashes differ
91+
92+
The example above shows that the [Multihash](glossary.md#multihash) inside a CID does not match a simple file checksum. This is because the Multihash is the hash of the [root block](glossary.md#root), not a direct hash of the file's bytes.
93+
94+
When you add a file to IPFS, the data goes through several transformations:
95+
96+
1. **Chunking**: Large files are split into smaller [blocks](glossary.md#block) (typically 256KiB-1MiB each)
97+
2. **Structuring**: These blocks are organized into a [DAG](glossary.md#dag) (directed acyclic graph)
98+
3. **Encoding**: A [codec](glossary.md#codec) wraps the data with metadata describing its structure
99+
100+
The root block contains links to all the other blocks, and it's this root block that gets hashed to produce the Multihash in your CID.
101+
102+
#### When CID hash equals file hash
103+
104+
There is one case where the Multihash does equal the file's hash: when the CID uses the `raw` [codec](glossary.md#codec) and the file fits in a single block. The `raw` codec stores bytes without any wrapper, so for small files added with `--raw-leaves`, the Multihash is a direct hash of the file contents.
105+
106+
#### Same file, different CIDs
107+
108+
Two identical files can produce different CIDs. The CID depends on both the content *and* how that content is structured:
109+
110+
- **Chunk size**: Different chunking strategies produce different block trees
111+
- **DAG layout**: Balanced trees vs. trickle DAGs organize blocks differently
112+
- **Codec**: [UnixFS](glossary.md#unixfs) ([dag-pb](glossary.md#dag-pb)), [dag-cbor](glossary.md#dag-cbor), `raw`, and others each encode data differently
113+
- **CID version**: [CIDv0](glossary.md#cid-v0) vs [CIDv1](glossary.md#cid-v1) use different formats
114+
- **Hash algorithm**: sha2-256, blake3, and others produce different hashes
115+
116+
#### Why this flexibility matters
117+
118+
This is a feature, not a limitation. Different structures optimize for different use cases:
119+
120+
- **DAG layout** trades off seeking against appending: balanced DAGs enable fast random access in large files like videos, trickle DAGs optimize for sequential, append-only data like logs
121+
- **Chunking strategy** balances retrieval overhead against sync efficiency: large chunks mean fewer blocks for bulk downloads, small chunks mean less data to transfer when syncing deltas. Strategies range from simple fixed-size chunking to content-defined algorithms like Rabin or Buzhash that fine-tune deduplication based on dataset characteristics
122+
- **Hash function** varies by system: legacy decisions, regulatory requirements, or interoperability needs may dictate which algorithm to use
123+
- **Directory sharding** threshold, in systems like [UnixFS](glossary.md#unixfs), determines when directories switch from flat listings to [HAMT](glossary.md#hamt-sharding) to seamlessly support huge directories with millions of files. This threshold also affects how much of the DAG needs to be recreated when a single file in the directory is modified
124+
125+
[UnixFS](glossary.md#unixfs) is the default format for files and directories, but you can use other codecs or create custom ones for specialized needs.
126+
127+
When you need reproducible CIDs across different tools, the community documents common parameter sets called [CID profiles](https://github.com/ipfs/specs/pull/499). These define standard combinations of chunking, DAG layout, and codec settings.
128+
129+
To explore how a CID is structured, use the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To see the DAG behind a CID, use the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
130+
90131

91132
## CID versions
92133

docs/quickstart/pin-cli.md

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,10 +134,10 @@ bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4
134134

135135
Once you have a CID, you can share it with anyone and they can fetch the file using IPFS.
136136

137-
To see what's in the anatomy of a CID, check out the [CID inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
137+
To explore the anatomy of a CID, check out the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To explore the anatomy of the DAG behind a CID, check out the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
138138

139139
:::callout
140-
**Important caveat:** Two identical files can produce different CIDs. The CID reflects the contents *and* how the file is processed: chunk size, DAG layout, hash algorithm, CID version, and other [UnixFS](https://specs.ipfs.tech/unixfs/) parameters. The same file processed with different parameters will produce different CIDs. Work is underway on [CID profiles](https://github.com/ipfs/specs/pull/499), which aims to address this by defining standard parameter sets to make CIDs reproducible and verifiable.
140+
**Important caveat:** Two identical files can produce different CIDs. The CID reflects the contents *and* how the file is processed: chunk size, DAG layout, hash algorithm, CID version, and other [UnixFS](https://specs.ipfs.tech/unixfs/) parameters. The same file processed with different parameters will produce different CIDs. See [CIDs are not file hashes](../concepts/content-addressing.md#cids-are-not-file-hashes) for details.
141141
:::
142142

143143
## Retrieving with a gateway
@@ -167,6 +167,8 @@ curl https://[BUCKET_NAME].ipfs.filebase.io/ipfs/[CID]
167167

168168
### Using public gateways
169169

170+
You can also use [public IPFS gateways](../concepts/public-utilities.md#public-ipfs-gateways):
171+
170172
```shell
171173
curl https://ipfs.io/ipfs/[CID]
172174
# or

0 commit comments

Comments
 (0)