You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- add "Why the hashes differ" section explaining root block hashing
- document when CID hash equals file hash (raw codec, single block)
- list factors affecting CID: chunk size, DAG layout, codec, version, hash
- explain flexibility as feature with tradeoffs for different use cases
- add DAG Explorer link alongside CID Inspector in quickstart
- link quickstart caveat to detailed explanation in concepts
- add public gateways link to public-utilities page
Copy file name to clipboardExpand all lines: docs/concepts/content-addressing.md
+41Lines changed: 41 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,6 +87,47 @@ shasum: WARNING: 1 computed checksum did NOT match
87
87
88
88
As we can see, the hash included in the CID does NOT match the hash of the input file `ubuntu-20.04.1-desktop-amd64.iso`.
89
89
90
+
### Why the hashes differ
91
+
92
+
The example above shows that the [Multihash](glossary.md#multihash) inside a CID does not match a simple file checksum. This is because the Multihash is the hash of the [root block](glossary.md#root), not a direct hash of the file's bytes.
93
+
94
+
When you add a file to IPFS, the data goes through several transformations:
95
+
96
+
1.**Chunking**: Large files are split into smaller [blocks](glossary.md#block) (typically 256KiB-1MiB each)
97
+
2.**Structuring**: These blocks are organized into a [DAG](glossary.md#dag) (directed acyclic graph)
98
+
3.**Encoding**: A [codec](glossary.md#codec) wraps the data with metadata describing its structure
99
+
100
+
The root block contains links to all the other blocks, and it's this root block that gets hashed to produce the Multihash in your CID.
101
+
102
+
#### When CID hash equals file hash
103
+
104
+
There is one case where the Multihash does equal the file's hash: when the CID uses the `raw`[codec](glossary.md#codec) and the file fits in a single block. The `raw` codec stores bytes without any wrapper, so for small files added with `--raw-leaves`, the Multihash is a direct hash of the file contents.
105
+
106
+
#### Same file, different CIDs
107
+
108
+
Two identical files can produce different CIDs. The CID depends on both the content *and* how that content is structured:
109
+
110
+
-**Chunk size**: Different chunking strategies produce different block trees
111
+
-**DAG layout**: Balanced trees vs. trickle DAGs organize blocks differently
112
+
-**Codec**: [UnixFS](glossary.md#unixfs) ([dag-pb](glossary.md#dag-pb)), [dag-cbor](glossary.md#dag-cbor), `raw`, and others each encode data differently
113
+
-**CID version**: [CIDv0](glossary.md#cid-v0) vs [CIDv1](glossary.md#cid-v1) use different formats
114
+
-**Hash algorithm**: sha2-256, blake3, and others produce different hashes
115
+
116
+
#### Why this flexibility matters
117
+
118
+
This is a feature, not a limitation. Different structures optimize for different use cases:
119
+
120
+
-**DAG layout** trades off seeking against appending: balanced DAGs enable fast random access in large files like videos, trickle DAGs optimize for sequential, append-only data like logs
121
+
-**Chunking strategy** balances retrieval overhead against sync efficiency: large chunks mean fewer blocks for bulk downloads, small chunks mean less data to transfer when syncing deltas. Strategies range from simple fixed-size chunking to content-defined algorithms like Rabin or Buzhash that fine-tune deduplication based on dataset characteristics
122
+
-**Hash function** varies by system: legacy decisions, regulatory requirements, or interoperability needs may dictate which algorithm to use
123
+
-**Directory sharding** threshold, in systems like [UnixFS](glossary.md#unixfs), determines when directories switch from flat listings to [HAMT](glossary.md#hamt-sharding) to seamlessly support huge directories with millions of files. This threshold also affects how much of the DAG needs to be recreated when a single file in the directory is modified
124
+
125
+
[UnixFS](glossary.md#unixfs) is the default format for files and directories, but you can use other codecs or create custom ones for specialized needs.
126
+
127
+
When you need reproducible CIDs across different tools, the community documents common parameter sets called [CID profiles](https://github.com/ipfs/specs/pull/499). These define standard combinations of chunking, DAG layout, and codec settings.
128
+
129
+
To explore how a CID is structured, use the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To see the DAG behind a CID, use the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
Once you have a CID, you can share it with anyone and they can fetch the file using IPFS.
136
136
137
-
To see what's in the anatomy of a CID, check out the [CID inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
137
+
To explore the anatomy of a CID, check out the [CID Inspector](https://cid.ipfs.tech/#bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4). To explore the anatomy of the DAG behind a CID, check out the [DAG Explorer](https://explore.ipld.io/#/explore/bafybeicn7i3soqdgr7dwnrwytgq4zxy7a5jpkizrvhm5mv6bgjd32wm3q4).
138
138
139
139
:::callout
140
-
**Important caveat:** Two identical files can produce different CIDs. The CID reflects the contents *and* how the file is processed: chunk size, DAG layout, hash algorithm, CID version, and other [UnixFS](https://specs.ipfs.tech/unixfs/) parameters. The same file processed with different parameters will produce different CIDs. Work is underway on [CID profiles](https://github.com/ipfs/specs/pull/499), which aims to address this by defining standard parameter sets to make CIDs reproducible and verifiable.
140
+
**Important caveat:** Two identical files can produce different CIDs. The CID reflects the contents *and* how the file is processed: chunk size, DAG layout, hash algorithm, CID version, and other [UnixFS](https://specs.ipfs.tech/unixfs/) parameters. The same file processed with different parameters will produce different CIDs. See [CIDs are not file hashes](../concepts/content-addressing.md#cids-are-not-file-hashes) for details.
0 commit comments