Skip to main content

Content Identifiers (CID)

Overview

Content Identifiers (CIDs) are foundational in content-addressable storage (CAS) systems, providing a globally unique, self-describing reference to digital content based on its data rather than its location. CIDs enable systems to efficiently and securely retrieve, verify, link, and manage data, enabling immutable and decentralized data storage solutions such as IPFS, IPLD, and DefraDB.

Why Content Identifiers Matter

Traditional web addresses (URLs) tell you where data lives—on a specific server, at a specific location.

Content Identifiers tell you what the data is—a unique fingerprint of the content itself.

This fundamental shift enables:

  • Decentralized architecture: Any node can serve data, not just the original source
  • Self-verifying data: Content proves its own integrity through cryptographic hashing
  • Permanent links: References that never break, even when data moves
  • Automatic deduplication: The same content always has the same identifier, eliminating redundant storage
  • True data portability: Content can move freely between platforms while maintaining its identity

This transformation from location-based to content-based addressing is as significant as the shift from IP addresses to domain names—but instead of making locations human-readable, CIDs make content itself addressable.

Content Identifier Basics

A CID uniquely identifies data by combining a cryptographic hash with encoding metadata. This makes a CID:

  • Deterministic: No randomness—the same input always yields the same CID
  • Consistent across locations: The same content always produces the same CID
  • Unique: Different content results in different CIDs
  • Self-describing: The identifier encodes what the data is and how to verify it

Understanding Cryptographic Hashes

To understand how CIDs achieve these properties, we first need to understand the cryptographic hashes that power them.

A cryptographic hash is a mathematical function that takes input data of any size and transforms it into a fixed-length string of bits, called a hash value or digest. This process is:

  • Deterministic: The same input always produces the same output
  • Collision-resistant: Different inputs produce different outputs
  • One-way: Cannot reverse the hash to get the original data
  • Sensitive: Even a tiny change in input results in a completely different hash value

Content-Addressable Storage (CAS) uses these cryptographic fingerprints to store and access data, ensuring integrity and enabling efficient deduplication.

Key CID Properties

PropertyDescriptionTechnical Benefit
ImmutabilityAny change to content changes the CIDEnables trustless verification
DeduplicationSame content anywhere yields the same CIDReduces storage significantly in typical datasets
Integrity verificationCIDs ensure the authenticity of retrieved dataCryptographic proof of data integrity
VersioningUnique CIDs support tracking content over timeImplicit version control

CID Structure

With these fundamentals in place, let's examine how CIDs are actually structured and what each component does.

Visual Overview

A CID consists of multiple components that work together to create a self-describing content identifier:

bafybeigdyrzt5sfp7udm7hu76uh7y26nf3efuylqabf3oclgtqy55fbzdi
│ └┬┘└───────────────────────────────────────────────────────┘
│ │ │
│ │ └── Base32-encoded multihash
│ └──────────────────────────────── Multicodec (dag-pb)
└──────────────────────────────────── Multibase prefix (b = base32)

Component Breakdown

  • Multibase prefix: Indicates how the CID is encoded (like choosing between binary and text). This allows CIDs to be represented in different formats for different use cases
  • Multicodec: Specifies what format the content uses (raw bytes, JSON, CBOR, etc.). This tells systems how to interpret the data
  • Multihash: Contains the actual cryptographic fingerprint of your content, along with information about which hash function was used

Technical Components

ComponentDescriptionDetailsExample Values
Multibase prefixSpecifies encoding formatFirst character(s) of the CIDb (base32), z (base58btc), f (base16)
MulticodecIdentifies content type/formatVarint-encoded codec identifier0x70 (dag-pb), 0x71 (dag-cbor), 0x55 (raw)
MultihashHash function and digestFunction ID + digest length + digestSHA-256, Blake2b-256, SHA-3

CID Versions

VersionDetailsExample CIDBinary Structure
CIDv0Base58btc encoding, supports only dag-pb and SHA-256QmYwAPJzv5CZsnA...<multihash> only
CIDv1Supports multiple codecs, hash functions, and encodingsbafybeigdyrzt5sf...<version><codec><multihash>