Blog Index

Trying to get error backtraces in rust libraries right

by dig, b5, ramfox

Error handling in Rust is one of those topics that can spark passionate debates in the community. After wrestling with various approaches in the iroh codebase, the team has developed some insights about the current state of error handling, the tradeoffs involved, and how to get the best of both worlds.

The Great Error Handling Divide

The Rust ecosystem has largely coalesced around two main approaches to error handling:

The anyhow approach: One big generic error type that can wrap anything. It's fast to implement, gives you full backtraces, and lets you attach context easily. Perfect for applications where you mainly care about "something went wrong" and want good debugging information.

The thiserror approach: Carefully crafted enum variants for every possible error case. This gives you precise error types that consumers can match on and handle differently. It's the approach that many library authors (rightfully) prefer because it provides a stable, matchable API.

Both approaches have their merits, but there's an interesting third option that's rarely discussed: the standard library's IO error model.

Can we have both?

The standard library's approach to IO errors is actually quite elegant. Instead of cramming everything into a single error type or creating hundreds of variants, it splits errors into two components:

  • Error kind: The broad category of what went wrong (permission denied, not found, etc.)
  • Error source: Additional context and the original error chain

This lets you match on the high-level error patterns while still preserving detailed information. You can write code that handles "connection refused" generically while still having access to the underlying TCP error details when needed.

Surprisingly, this pattern hasn't been adopted widely in other Rust libraries. It strikes a nice balance between the two extremes.

The Backtrace Problem

Here's where things get frustrating: If you want proper error handling with backtraces, you're in for a world of pain due to fundamental limitations in Rust's error handling story.

The core issue is that Rust still hasn't stabilized backtrace propagation on errors. For more context, take a look at this comment, as well as the rest of the thread.

This creates a cascade of problems:

  • anyhow can provide full backtraces because all errors are anyhow errors, and it has an extension trait that can propagate traces through the chain
  • thiserror cannot reliably provide backtraces when errors are nested, because each error type would need to know about the backtrace inside its wrapped errors

The technical limitation comes down to Rust's trait system. When you implement Into<YourError> for the ? operator to work nicely, you need a blanket implementation for all error types. But this conflicts with backtrace handling because you can only access backtraces on concrete types, not through the Error trait.

This means you get to choose: either nice ergonomics with ? or backtraces. You can't have both without significant workarounds.

To be clear, we are not criticizing the rust maintainers; this is difficult work. But it does mean that crate authors have to make tough choices when it comes to error handling.

Enter Snafu: The Hybrid Approach

After considerable experimentation and, admittedly, some screaming at the compiler, we found a solution that works for our needs: snafu.

Snafu is essentially thiserror on steroids. It provides:

  • Enum-based error types with derive macros (like thiserror)
  • Rich context attachment and error chaining
  • Automatic backtrace capture when constructing error variants
  • Extension traits that work around Rust's limitations

The key breakthrough is figuring out how to wrap snafu errors within other snafu and non-snafu, while preserving the full backtrace chain. This required some careful incantations to work around the Into trait conflicts, but the result is that developers can now have an IO error nested three levels deep and still get a complete backtrace.

When using snafu (in conjunction with our n0-snafu crate—more on this below), our test failures now look like this (with RUST_BACKTRACE=1 ):

Error: 
    0: The relay denied our authentication (not authorized)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
9 frames hidden ⋮                               
10: iroh_relay::server::tests::test_relay_access_control::{{closure}}::hd7e62eebdecb5f10
    at /iroh/iroh-relay/src/server.rs:987
21 frames hidden ⋮                              
32: iroh_relay::server::tests::test_relay_access_control::hf276e536250e2f5f
    at /iroh/iroh-relay/src/server.rs:1016
33: iroh_relay::server::tests::test_relay_access_control::{{closure}}::h72fb6babf688bbfd
    at /iroh/iroh-relay/src/server.rs:948
23 frames hidden ⋮                              

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
8 frames hidden ⋮                               
 9: <iroh_relay::client::ConnectError as core::convert::From<iroh_relay::protos::handshake::Error>>::from::h862bd832592732c4
    at /iroh/iroh-relay/src/client.rs:56
1 frame hidden ⋮                               
11: iroh_relay::client::ClientBuilder::connect::{{closure}}::h5a1014df84d149d0
    at /iroh/iroh-relay/src/client.rs:281
12: iroh_relay::server::tests::test_relay_access_control::{{closure}}::hd7e62eebdecb5f10
    at /iroh/iroh-relay/src/server.rs:986
21 frames hidden ⋮                              
34: iroh_relay::server::tests::test_relay_access_control::hf276e536250e2f5f
    at /iroh/iroh-relay/src/server.rs:1016
35: iroh_relay::server::tests::test_relay_access_control::{{closure}}::h72fb6babf688bbfd
    at /iroh/iroh-relay/src/server.rs:948
23 frames hidden ⋮                              

A lot of credit deserves to go to eyre, after which our error formatting is based!

Our push for concrete errors with backtraces

As part of our push to 1.0, we’re transitioning to structured errors using snafu. We have started this conversion in iroh v0.90.

We’ve learned a lot so far, and there is further to go. We have some established patterns, but still need to ensure that all of our APIs follow those patterns, as well as ensure that any logging or error reporting formats the information in a way that’s easy to understand. Or, at least, as easy to understand as possible.

We also very much missed how ergonomic anyhow is to work with, especially when writing tests. We now have a n0-snafu crate that provides utilities for working with snafu, that help claw back some of this ease-of-use especially when writing tests or examples.

Concrete-error writing guidelines

Here are some guidelines we’ve used while writing concrete-errors.

Error enums are scoped to functions not modules

During the initial refactor of our errors to use concrete types, we leaned toward the module-level error approach. It did make the conversion more simple at first and was a good stepping stone: we didn’t have to worry as much about enum hierarchy, for example, and instead shoved everything into one enum.

For complex parts of our code, however, this soon became unwieldily.

This was especially apparent in what eventually became the iroh-relay::client::ConnectError enum. SO many things can go wrong during a connection to relay server, even before you attempt to dial the relay server!

We quickly realized that we needed some additional hierarchy: everything that can go wrong before dialing, and the errors that occur while dialing. Hence, we have the DialError enum nested inside the ConnectError enum.

Lean toward error enum names that are descriptive of the error, when logical

One positive side effect of naming enums based around its function and purpose, rather than just having one giant enum for the whole module, was how the name of the enums allowed you to understand much more quickly the kinds of things that could go wrong in a function or method.

A good example of this is our ticket::ParseError enum. Previously, this was a ticket::Error. We decided ParseError was a more descriptive and logical name: the only kind of errors you can get when working with the ticket are issues that can occur when parsing the ticket: maybe it’s the wrong “kind” of ticket, maybe there are issues when serializing or deserializing, or verifying the ticket. Calling it a ParseError means that any user who looks at the API can understand the scope of things that can go wrong when using a ticket, before reading any documentation.

This came up mostly when looking at functions and methods that had simple or lower-level functionality. For example, the connect function mentioned above had so many possible categories of errors that calling the enum ConnectError was actually the most descriptive and accurate name we could give it.

Errors for public traits should contain a Custom variant, with helpful APIs for creating that variant

It doesn’t necessarily need to be called Custom, but for traits that folks working with iroh can implement themselves, we needed to ensure that they could use the errors associated with that trait for their own purposes.

A great example of this is our Discovery trait, that has a DiscoveryError:

/// Discovery errors
#[common_fields({
    backtrace: Option<snafu::Backtrace>,
    #[snafu(implicit)]
    span_trace: n0_snafu::SpanTrace,
})]
#[allow(missing_docs)]
#[derive(Debug, Snafu)]
#[non_exhaustive]
pub enum DiscoveryError {
    #[snafu(display("No discovery service configured"))]
    NoServiceConfigured {},
    #[snafu(display("Discovery produced no results for {}", node_id.fmt_short()))]
    NoResults { node_id: NodeId },
    #[snafu(display("Service '{provenance}' error"))]
    User {
        provenance: &'static str,
        source: Box<dyn std::error::Error + Send + Sync + 'static>,
    },
}

impl DiscoveryError {
    /// Creates a new user error from an arbitrary error type.
    pub fn from_err<T: std::error::Error + Send + Sync + 'static>(
        provenance: &'static str,
        source: T,
    ) -> Self {
        UserSnafu { provenance }.into_error(Box::new(source))
    }

    /// Creates a new user error from an arbitrary boxed error type.
    pub fn from_err_box(
        provenance: &'static str,
        source: Box<dyn std::error::Error + Send + Sync + 'static>,
    ) -> Self {
        UserSnafu { provenance }.into_error(source)
    }
}

We have some specific errors, NoServiceConfigured and NoResults that we use in our own discovery implementations, but we also have a User error that allows someone who is implementing their own discovery trait to propagate whatever appropriate errors they need.

We also provide DiscoveryError::from_err and DiscoveryError::from_error_box to easily allow users to create whatever DiscoveryErrors they need.

The Tradeoffs Are Real

Let's be honest about the costs:

Structured errors require more work upfront. You need to think about error variants, write more boilerplate, and make decisions about error hierarchies.

Generic errors are faster to implement. When you just need to get something working, anyhow is hard to beat for velocity.

Library vs. application needs differ. Libraries benefit more from structured errors because they need stable APIs. Applications often care more about debugging information than precise error matching.

The tooling isn't perfect. Rust's error handling story has fundamental limitations that require workarounds and compromise.

n0-snafu

One of the biggest sources of frustration we faced during the conversion to concrete-errors with backtraces, was that we missed the ergonomics of anyhow when writing tests and examples. snafu does have their own version of anyhow::anyhow! called snafu::whatever!, but we ran into friction during tests and examples when we wanted to return any combination of whatever errors, anyhow errors, concrete errors we created in iroh, and concrete errors from other libraries we are using.

For that, we wrote n0-snafu , a utility crate that allows for working with snafu (and other types of errors) with ease. It’s not quite as ergonomic as if you were just using anyhow throughout your entire application, but again, we’ve already established that part of the game here is trade-offs.

The benefits of using n0-snafu in combination with snafu were the most apparent in tests, by using n0-snafu::Result and the n0-snafu::ResultExt , we could gain back some of the ease-of-use that we had when relying on anyhow. Here is a parsed-down example of an actual test in iroh:

#[cfg(test)]
mod tests {
	// allows us to use the `.e()` and `.with_context()` methods:
	use n0_snafu::ResultExt; 
	...
	
	#[tokio::test]
	async fn endpoint_connect_close() -> n0_snafu::Result {
		...
    let ep = Endpoint::builder()
        .secret_key(server_secret_key)
        .alpns(vec![TEST_ALPN.to_vec()])
        .relay_mode(RelayMode::Custom(relay_map.clone()))
        .insecure_skip_relay_cert_verify(true)
        .bind()
        // returns an `iroh::BindError`, so it
        // can be implicitly returned without explicit conversion:
        .await?; 

    let server = tokio::spawn(
        async move {
            info!("accepting connection");
            // returns an `Option`, it needs to be converted to
            // a `Result` using `.e()`:
            let incoming = ep.accept().await.e()?;
            
            // returns a `quinn::ConnectionError`
            // needs to be converted into a `n0_snafu::Error` using the `.e()` method
            // in order to use the `?`:
            let conn = incoming.await.e()?;
            // same as above:
            let mut stream = conn.accept_uni().await.e()?;
            let mut buf = [0u8; 5];
            // `.with_context` allows you to add context to the error when
            // converting to a `n0_snafu::Error`:
            stream.read_exact(&mut buf).await.with_context(|| format!("could not read from the stream")?;
            ...
            // check out `iroh/src/endpoint.rs for the full test
   }
}

Looking Forward

There is a lot of pressure from the Rust community to ensure that all libraries return concrete errors—this is not misguided—structured errors do provide real benefits for library APIs and error handling. But the pragmatic reality is that different projects have different needs.

For the iroh project, the hybrid approach is working well:

  • Use structured errors for public APIs where consumers need to handle different cases
  • Preserve rich context and backtraces for debugging

The error handling landscape in Rust is still evolving. Until backtrace propagation is stabilized and the ergonomics improve, teams are making tradeoffs. The key is being intentional about those tradeoffs rather than letting dogma drive technical decisions.

  • Accept that some boilerplate is the cost of precise error handling

What matters most is choosing an approach that serves the project's needs—whether that's the simplicity of anyhow, the precision of thiserror, or something in between. The perfect error handling system doesn't exist, but good-enough error handling that ships is infinitely better than perfect error handling that never gets implemented.

Iroh is a dial-any-device networking library that just works. Compose from an ecosystem of ready-made protocols to get the features you need, or go fully custom on a clean abstraction over dumb pipes. Iroh is open source, and already running in production on hundreds of thousands of devices.
To get started, take a look at our docs, dive directly into the code, or chat with us in our discord channel.