Iroh and DAGs
by Rüdiger KlaehnProblem
By default, iroh has a very minimal set of primitives for content-addressed data. We offer BLAKE3 hashed blobs of arbitrary size, with a very rich set of capabilities including verified range requests, verified streaming and size proofs. In addition, we offer a single container type: a sequence of BLAKE3 hashes. Using these two primitives, we provide specialized collections of data. A common pattern is to use a hash sequence, with the first child being a blob of self-describing metadata. These primitives are sufficiently versatile to handle many use cases, such as file system sync in sendme.
Focusing on BLAKE3 and the various capabilities that BLAKE3 makes possible is the right choice for iroh. But that does not change the fact that sometimes you have different needs. You might want to use different hash functions. For example, you may have existing data that is hashed with the ubiquitous SHA-2 hash function, or you might need to use a hash function like poseidon that is more suitable to use in zero-knowledge proofs. Also, you might have data that naturally forms a directed acyclic graph (DAG), such as a git commit history, unixfs data, event logs, or even blockchains.
Iroh-blobs does not completely solve these use cases, but it provides a very useful building block of a possible solution. Today we will explore how to build a system that can sync DAGs using iroh-blobs as a building bloc. We'll use an IPLD graph as an example, but the same principles apply to other DAGs.
Local storage
The default iroh-blobs
store does not contain information about the DAG structure. So, we need an additional persistent store for this information:
- a table that maps a generic IPLD hash (hash format and digest) to a BLAKE3 hash. (We must only ever populate this table if we have verified that the IPLD hash corresponds to the BLAKE3 hash. We must not trust external sources for this mapping!)
In addition, since extracting links out of IPLD data blocks can be somewhat expensive, we have an additional table to cache this step:
- a table that contains the IPLD links for a (IPLD format, BLAKE3 hash) tuple.
Using these two tables, we can cheaply traverse the DAG from a given root CID, for example to answer questions of liveness. To look up the links for a CID, you first look up the corresponding BLAKE3 hash for the hash format and digest, then look up the links by IPLD format and BLAKE3 hash.
As we will see, such traversals will also play a key role in the sync protocol.
Sync protocol
When syncing DAGs, especially deep DAGs such as the ones formed by event sourcing systems like actyx, IPFS, commit histories, or blockchains, it is essential to minimize the number of roundtrips. A naive protocol might do the DAG traversal by just asking for blobs level by level. Such a scheme could use the existing iroh-blobs protocol, but it would lead to a very large number of roundtrips for deep DAGs. While for some shallow tree like DAGs this would not be a big deal, for deep DAGs we would never be able to saturate a good network connection when syncing.
So we need a custom protocol that understands DAGs. Luckily it is very easy to define custom protocols in iroh, either by building a custom iroh node using iroh-net, or by using the new (as of v0.19) feature of custom protocol handlers.
Requirements
One of the requirements for a sync protocol is that the receiver must be able to validate that all data fulfills three criteria. It must
- correspond to the BLAKE3 hash. we can check this incrementally every 16 KiB as the data comes in
- correspond to the IPLD hash. we can only check this once we have the complete blob, since the IPLD hash abstraction does not support incremental verification even though some hash functions allow it
- is actually part of the DAG we have asked for. every content identifier (CID) we receive must be connected to one of the roots we want to sync
The first two criteria are simple to implement. We can just use the iroh-blobs sync protocol (which is basically BLAKE3 verified streaming with a chunk group size of 16KiB) to make sure that the data is correct. And we can just hash with the non-BLAKE3 hash function as soon as we have the complete blob.
But the third criterium means that the information about the CIDs must only come from the roots we have asked for or from data we have locally. Otherwise the remote could just send us an arbitrary CID referring a lot of data like
rickroll.mp4
, and we would not notice this quickly.
Request
For this experiment, we have decided to go with a simple request/response protocol. This makes it possible to omit any information that we know both sides have, making the protocol very minimal and efficient, similar to BLAKE3 verifed streaming.
A request contains a configuration for a deterministic traversal of the DAG. The sender will execute this traversal on their local DAG and send information about the nodes it encounters to the requester. Information can be either just the BLAKE3 hash, or the BLAKE3 hash and the complete data inline in bao4 format.
We don't include the CID, since as we have seen the information about the CID needs to come from the requester.
So how does the receiver know which CID the BLAKE3 hash that it receives corresponds to? It has to execute the same deterministic traversal as the sender locally and zip the sequence of blobs it receives from the sender with the locally generated sequence of CIDs.
Response
The response is a sequence of items, where each item is either a BLAKE3 hash or a BLAKE3 hash and the data for it.
Processing the response means zipping the sequence of response items with the locally generated sequence of CIDs, then validating the content of each blob using the IPLD hash in the CID. At this point it is safe to insert an entry into the IPLD hash to BLAKE3 hash mapping table. The next step will be to extract the links of the block and insert them into the links table.
At this point, the we know that the item is part of the DAG we have asked for, we have incrementally verified the BLAKE3 has as we received the data, and we have verified the IPLD hash after receiving the data for the blob.
Depending on the use case, there might be an additional step that checks some application level data inside the just received blob, such as signatures or checksums.
Deterministic traversals
As we have seen, we will need a way to define deterministic traversals. There are some trivial deterministic traversals that are generically useful. For example just a sequence of unrelated CIDs with no DAG traversal which can be used to request a sequence of individual blocks.
But beyond that, there is a very large variety of possible traversals. You might want to traverse the DAG in different orders, e.g. depth-first, pre-order, left to right
. You might want to limit the traversal to a certain DAG depth, or filter out leaves (CIDs with format raw).
But it gets more complex than this. You might want to look into the actual DAG data itself. E.g. only follow DAG-CBOR links that have a path /prev
, and not ones that have a path /data
to follow just the stem of a linked list like DAG.
Or even more complex, only follow links in blocks where some sort of checksum or signature checks out.
You could try to design a generic language to specify (deterministic) DAG traversals. A protocol that takes this approach is graphsync. But especially for some of the more complex use cases mentioned above you would end up having to define a turing complete language.
So, we have decided to just implement a number of useful traversals of varying level of complexity, but allow implementing additional deterministic traversals in rust. Of course this means that sync will have to be tailored to the needs of the application, and different applications will have different ALPNs for their sync protocols.
Trying it out
The demo in iroh-experiments implements a simple unidirectional sync between two nodes. You can think of it as sendme, but for DAGs.
Generate some data
We need a car file. You can just import some directory into ipfs and then export it as a car file. Make sure to use --raw-leaves
to have a more interesting DAG structure. You'll need kubo for this.
Let's use the linux kernel sources as an example. Feel free to use any directory you like.
> ipfs add -r --raw-leaves ../linux
> ipfs DAG export QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV > linux.car
Import the data
Checkout the iroh-dag-sync experiment, change to the iroh-dag-sync
directory, and run the import command:
> cargo run --release import linux.car
...
root: QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV
This will create two databases in the current directory. dag.db
contains information about the structure of the DAG, blobs.db
(a directory) contains the raw data.
Start a node that makes the data available
> cargo run --release node
I am irgkesdtbih664hq2fjgd6zf7g6mazqkr7deqzplavmwl3vdbboa
Sync with the default traversal
In a different directory, start the sync process, plugging in the CID we got from the import step, and the node we got from the node step:
> mkdir tmp
> cd tmp
> cargo run --release sync --from irgkesdtbih664hq2fjgd6zf7g6mazqkr7deqzplavmwl3vdbboa QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV
This will traverse the entire DAG in depth-first, pre-order, left-to-right traversal order. Which may take a while. But - it is just a single request/response pair, so we will saturate the network connection provided local io on both sides can keep up, even if the connection has a high latency.
Export the synced data (optional)
> cargo run --release export QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV --target output.car
Note: exporting without specifying a target just dumps the CIDs to stdout.
Advanced use
The above example syncs a large DAG using a single request/response interaction. It uses the default strategy of traversing the DAG in depth first, preorder, left to right, which does the job if you want to sync the entire DAG.
But what if you want to do a partial sync? You can specify more complex configurations for the existing defined traversals.
Exclude leaf nodes
cargo run --release sync --from bsmlrj4sodhaivs2r7tssw4zeasqqr42lk6xt4e42ikzazkp4huq \
--traversal 'Full(root:"QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV",filter:NoRaw)'
Here we specify the traversal to use as a RON expression, specifying a full traversal where we filter out CIDs that have format raw and are guaranteed to be leaf nodes of the DAG.
In case of unixfs data, the vast majority of the bytes of the DAG are in the leaf nodes, so requesting only non-leaf nodes gives us the ability to inspect the directory structure while saving a lot of bandwidth.
We could also, in a second step, request all leaves from multiple senders.
Full traversal, but exclude data for leaf nodes.
cargo run --release sync --from bsmlrj4sodhaivs2r7tssw4zeasqqr42lk6xt4e42ikzazkp4huq \
--traversal 'Full(root:"QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV")' --inline NoRaw
In this case we do a full traversal, but only inline data for non-raw blocks that can possibly contain links. For all blocks of type raw we just send the corresponding BLAKE3 hash. The requester can then take that information and try to get the data itself from somewhere else. They however have to check that the data matches the IPLD hash before updating the mapping table.
Just leaf nodes
cargo run --release sync --from bsmlrj4sodhaivs2r7tssw4zeasqqr42lk6xt4e42ikzazkp4huq \
--traversal 'Full(root:"QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV",filter:JustRaw)'
This traversal will fail unless the requester already has all non-leaf nodes. It can be used as a second step to do a complete sync if we already requested the branch nodes before.
Limiting the traversal to unknown data
cargo run --release sync --from bsmlrj4sodhaivs2r7tssw4zeasqqr42lk6xt4e42ikzazkp4huq \
--traversal 'Full(root:"QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV",visited:["bafkreifm6edrm6jidkqb4ymcbdolkancs3kmboq3eissmfg2ofcwonztgq"])'
Here we already have bafkreifm6edrm6jidkqb4ymcbdolkancs3kmboq3eissmfg2ofcwonztgq
and everything below it, and want to sync only the DAG between QmWyLtd4WEJe45UBqCZG94gYY9B8qF3k4DKFX3o2bodHmV
(inclusive) and bafkreifm6edrm6jidkqb4ymcbdolkancs3kmboq3eissmfg2ofcwonztgq
(exclusive).
Other traversals
As mentioned above, these are just examples to illustrate the concept of deterministic traversals. For a real demanding application you might have to write your own traversal that is specific to your application needs and run the protocol under a custom ALPN.
Next steps
There are several parts missing to make this production ready. The next step would be to write generators for realistic DAGs and perform benchmarks with them. In addition, it needs a concept for tagging and garbage collection, very likely similar to the approach described in this talk.
And last but not least it would be good to have a concept for multi-party sync. Basically requesting the same data from multiple nodes efficiently, including a stream of cancellations to avoid double downloads while having maximum throughput.
To get started, take a look at our docs, dive directly into the code, or chat with us in our discord channel.