Content Identifiers (CID)

Overview

Content Identifiers (CIDs) are foundational in content-addressable storage (CAS) systems, providing a globally unique, self-describing reference to digital content based on its data rather than its location. CIDs enable systems to efficiently and securely retrieve, verify, link, and manage data, enabling immutable and decentralized data storage solutions such as IPFS, IPLD, and DefraDB.

Why Content Identifiers Matter

Traditional web addresses (URLs) tell you where data lives—on a specific server, at a specific location.

Content Identifiers tell you what the data is—a unique fingerprint of the content itself.

This fundamental shift enables:

Decentralized architecture: Any node can serve data, not just the original source
Self-verifying data: Content proves its own integrity through cryptographic hashing
Permanent links: References that never break, even when data moves
Automatic deduplication: The same content always has the same identifier, eliminating redundant storage
True data portability: Content can move freely between platforms while maintaining its identity

This transformation from location-based to content-based addressing is as significant as the shift from IP addresses to domain names—but instead of making locations human-readable, CIDs make content itself addressable.

Content Identifier Basics

A CID uniquely identifies data by combining a cryptographic hash with encoding metadata. This makes a CID:

Deterministic: No randomness—the same input always yields the same CID
Consistent across locations: The same content always produces the same CID
Unique: Different content results in different CIDs
Self-describing: The identifier encodes what the data is and how to verify it

Understanding Cryptographic Hashes

To understand how CIDs achieve these properties, we first need to understand the cryptographic hashes that power them.

A cryptographic hash is a mathematical function that takes input data of any size and transforms it into a fixed-length string of bits, called a hash value or digest. This process is:

Deterministic: The same input always produces the same output
Collision-resistant: Different inputs produce different outputs
One-way: Cannot reverse the hash to get the original data
Sensitive: Even a tiny change in input results in a completely different hash value

Content-Addressable Storage (CAS) uses these cryptographic fingerprints to store and access data, ensuring integrity and enabling efficient deduplication.

Key CID Properties

Property	Description	Technical Benefit
Immutability	Any change to content changes the CID	Enables trustless verification
Deduplication	Same content anywhere yields the same CID	Reduces storage significantly in typical datasets
Integrity verification	CIDs ensure the authenticity of retrieved data	Cryptographic proof of data integrity
Versioning	Unique CIDs support tracking content over time	Implicit version control

CID Structure

With these fundamentals in place, let's examine how CIDs are actually structured and what each component does.

Visual Overview

A CID consists of multiple components that work together to create a self-describing content identifier:

bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi
│ └┬┘└───────────────────────────────────────────────────────┘
│  │                              │
│  │                              └── Base32-encoded multihash
│  └──────────────────────────────── Multicodec (dag-pb)
└──────────────────────────────────── Multibase prefix (b = base32)

Component Breakdown

Multibase prefix: Indicates how the CID is encoded (like choosing between binary and text). This allows CIDs to be represented in different formats for different use cases
Multicodec: Specifies what format the content uses (raw bytes, JSON, CBOR, etc.). This tells systems how to interpret the data
Multihash: Contains the actual cryptographic fingerprint of your content, along with information about which hash function was used

Technical Components

Component	Description	Details	Example Values
Multibase prefix	Specifies encoding format	First character(s) of the CID	`b` (base32), `z` (base58btc), `f` (base16)
Multicodec	Identifies content type/format	Varint-encoded codec identifier	`0x70` (dag-pb), `0x71` (dag-cbor), `0x55` (raw)
Multihash	Hash function and digest	Function ID + digest length + digest	SHA-256, Blake2b-256, SHA-3

CID Versions

Version	Details	Example CID	Binary Structure
CIDv0	Base58btc encoding, supports only dag-pb and SHA-256	`QmYwAPJzv5CZsnA...`	`<multihash>` only
CIDv1	Supports multiple codecs, hash functions, and encodings	`bafybeigdyrzt5sf...`	`<version><codec><multihash>`

Content Identifiers (CID)

Overview​

Why Content Identifiers Matter​

Content Identifier Basics​

Understanding Cryptographic Hashes​

Key CID Properties​

CID Structure​

Visual Overview​

Component Breakdown​

Technical Components​

CID Versions​