Stop Forwarding Errors, Start Designing Them

Jan 04 ·

10 Min Read

It’s 3am. Production is down. You’re staring at a log line that says:

Error: serialization error: expected ',' or '}' at line 3, column 7

You know JSON is broken. But you have zero idea why, where, or who caused it. Was it the config loader? The user API? The webhook consumer?

The error has successfully bubbled up through 20 layers of your stack, preserving its original message perfectly, yet losing every scrap of meaning along the way.

We have a name for this. We call it “Error Handling.” But in reality, it’s just Error Forwarding. We treat errors like hot potatoes—catch them, wrap them (maybe), and throw them up the stack as fast as possible.

You add a println!, restart the service, wait for the bug to reproduce. It’s going to be a long night.

As noted in a detailed analysis of error handling in a large Rust project:

“There’re tons of opinionated articles or libraries promoting their best practices, leading to an epic debate that never ends. We were all starting to notice that there was something wrong with the error handling practices, but pinpointing the exact problems is challenging.”

What’s Wrong with Current Practices

The `std::error::Error` Trait: A Noble but Flawed Abstraction

The standard Error trait is built around source(): one error optionally points to another. That matches a lot of failures.

But some of the nastiest problems aren’t a single line of causality. Validation can fail in five places at once. A batch operation can partially succeed. Timeouts can come with partial results. Those want something closer to a set or a tree of causes, not a single chain.

Backtraces: Expensive Medicine for the Wrong Disease

Rust’s std::backtrace::Backtrace was meant to improve error observability. It’s better than nothing. But they have serious limitations:

In async code, they can be noisy or misleading. Your backtrace will contain 49 stack frames, of which 12 are calls to GenFuture::poll(). The Async Working Group notes that suspended tasks are invisible to traditional stack traces.

They only show the origin, not the path. A backtrace tells you where the error was created, not the logical path it took through your application. It won’t tell you “this was the request handler for user X, calling service Y, with parameters Z.”

Capturing backtraces is expensive. The standard library documentation acknowledges: “Capturing a backtrace can be a quite expensive runtime operation.”

The Provide/Request API: Overengineering in Action

The Provider API (RFC 3192) and generic member access (RFC 2895) add dynamic type-based data access to errors:

fn provide<'a>(&'a self, request: &mut Request<'a>) {
    request.provide_ref::<Backtrace>(&self.backtrace);
}

The unstable Provide/Request API represents the latest attempt to make errors more flexible. The idea: errors can dynamically provide typed context (like HTTP status codes or backtraces) that callers can request at runtime. In practice, it introduces new problems:

Unpredictability: Your error might provide an HTTP status code. Or it might not. You won’t know until runtime.

Complexity: The API is subtle enough that LLVM struggles to optimize multiple provide calls.

Most of the time, a boring struct with named fields is still the thing you want.

`thiserror`: Categorizing by Origin, Not by Action

thiserror makes it easy to define error enums:

#[derive(Debug, thiserror::Error)]
pub enum DatabaseError {
    #[error("connection failed: {0}")]
    Connection(#[from] ConnectionError),
    #[error("query failed: {0}")]
    Query(#[from] QueryError),
    #[error("serialization failed: {0}")]
    Serde(#[from] serde_json::Error),
}

This looks reasonable. But notice how this common practice categorizes errors: by origin, not by what the caller can do about it.

When you receive a DatabaseError::Query, what should you do? Retry? Report raw SQL to the user? The error doesn’t tell you. It just tells you which dependency failed.

As one blogger aptly put it: “This error type does not tell the caller what problem you are solving but how you solve it.”

`anyhow`: So Convenient You’ll Forget to Add Context

anyhow takes the opposite approach: type erasure. Just use anyhow::Result<T> everywhere and propagate with ?. No more enum variants, no more #[from] annotations.

The problem is that it’s too convenient.

fn process_request(req: Request) -> anyhow::Result<Response> {
    let user = db.get_user(req.user_id)?;
    let data = fetch_external_api(user.api_key)?;
    let result = compute(data)?;
    Ok(result)
}

Every ? is a missed opportunity to add context. What was the user ID? What API were we calling? What computation failed? The error knows none of this.

The anyhow documentation encourages using .context() to add information. But .context() is optional—the type system doesn’t require it. And “I’ll add context later” is the easiest lie to tell yourself.

The Problem: Error Handling Without Purpose

Consider this common pattern in Rust codebases:

#[derive(thiserror::Error, Debug)]
pub enum ServiceError {
    #[error("database error: {0}")]
    Database(#[from] sqlx::Error),
    #[error("http error: {0}")]
    Http(#[from] reqwest::Error),
    #[error("serialization error: {0}")]
    Serde(#[from] serde_json::Error),
    // ... ten more variants
}

It looks neat, well-structured, and it compiles. But pause and ask:

If you are holding a DatabaseError::Query, is it retryable? Should you show the raw SQL error to users? The error type doesn’t help answer these questions.
When debugging, does “serialization error: expected , or }” tell you which request, which field, which code path led here?

This is the fundamental disconnect in how we think about error handling. We focus on propagating errors exactly, on making the types line up. But we forget that errors are messages—messages that will eventually be read by either a machine trying to recover, or a human trying to debug.

The “Library vs Application” Myth

You’ve probably heard the conventional wisdom: “Use thiserror for libraries, anyhow for applications.”

It’s a nice, simple rule, just not quite right. As Luca Palmieri notes: “It is not the right framing. You need to reason about intent.”

The real question isn’t whether you’re writing a library or an application. The real question is: what do you expect the caller to do with this error?

Two Audiences, Two Needs

Audience	Goal	Needs
Machines	Automated recovery	Flat structure, clear error kinds, predictable codes
Humans	Debugging	Rich context, call path, business-level information

Most error handling designs optimize for neither. They optimize for the compiler.

For Machines: Flat, Actionable, Kind-Based

When errors need to be handled programmatically, complexity is the enemy. Your retry logic doesn’t want to traverse a nested error chain checking for specific variants. It wants to ask: is_retryable()?

Apache OpenDAL’s error design shows one way to do this:

pub struct Error {
    kind: ErrorKind,
    message: String,
    status: ErrorStatus,
    operation: &'static str,
    context: Vec<(&'static str, String)>,
    source: Option<anyhow::Error>,
}

pub enum ErrorKind {
    NotFound,
    PermissionDenied,
    RateLimited,
    // ... categorized by what the caller CAN DO
}

pub enum ErrorStatus {
    Permanent,   // Don't retry
    Temporary,   // Safe to retry
    Persistent,  // Was retried, still failing
}

Then the call site stays straightforward:

match result {
    Err(e) if e.kind() == ErrorKind::RateLimited && e.is_temporary() => {
        sleep(Duration::from_secs(1)).await;
        retry().await
    }
    Err(e) if e.kind() == ErrorKind::NotFound => {
        create_default().await
    }
    Err(e) => return Err(e),
    Ok(v) => v,
}

A few things to note:

ErrorKind is categorized by response, not origin. NotFound means “the thing doesn’t exist, don’t retry.” RateLimited means “slow down and try again.” The caller doesn’t need to know whether it was an S3 404 or a filesystem ENOENT—they need to know what to do about it.

ErrorStatus is explicit. Instead of guessing retryability from error types, it’s a first-class field. Services can mark errors as temporary when they know a retry might help.

One Error type per library. Instead of scattering error enums across modules, a single flat structure keeps things simple. The context field provides all the specificity you need without type proliferation.

No more traversing error chains, no more guessing from error types. Just ask the error directly.

For Humans: Low-Friction Context Capture

The biggest enemy of good error context isn’t capability—it’s friction. If adding context is annoying, developers won’t do it.

The exn library (294 lines of Rust, zero dependencies) demonstrates one approach: errors form a tree of frames, each automatically capturing its source location via #[track_caller]. Unlike linear error chains, trees can represent multiple causes—useful when parallel operations fail or validation produces multiple errors.

The key ingredients:

Automatic location capture. Instead of expensive backtraces, use #[track_caller] to capture file/line/column at zero cost. Every error frame should know where it was created.

Ergonomic context addition. The API for adding context should be so natural that not adding it feels wrong:

fetch_user(user_id)
    .or_raise(|| AppError(format!("failed to fetch user {user_id}")))?;

Compare this to thiserror, where adding the same context requires defining a new variant and manual wrapping:

#[derive(thiserror::Error, Debug)]
pub enum AppError {
    #[error("failed to fetch user {user_id}: {source}")]
    FetchUser {
        user_id: String,
        #[source]
        source: DbError,
    },
    // ... one variant per call site that needs context
}

fn fetch_user(user_id: &str) -> Result<User, AppError> {
    db.query(user_id).map_err(|e| AppError::FetchUser {
        user_id: user_id.to_string(),
        source: e,
    })?
}

Enforce context at module boundaries. This is where exn differs critically from anyhow. With anyhow, every error is erased to anyhow::Error, so you can always use ? and move on—the type system won’t stop you. The context methods exist, but nothing prevents you from ignoring them.

exn takes a different approach: Exn<E> preserves the outermost error type. If your function returns Result<T, Exn<ServiceError>>, you can’t directly ? a Result<U, Exn<DatabaseError>>—the types don’t match. The compiler forces you to call or_raise() and provide a ServiceError, which is exactly the moment you should be adding context about what your module was trying to do.

// This won't compile--type mismatch forces you to add context
pub fn fetch_user(user_id: &str) -> Result<User, Exn<ServiceError>> {
    let user = db.query(user_id)?;  // Error: expected Exn<ServiceError>, found Exn<DbError>
    Ok(user)
}

// You must provide context at the boundary
pub fn fetch_user(user_id: &str) -> Result<User, Exn<ServiceError>> {
    let user = db.query(user_id)
        .or_raise(|| ServiceError(format!("failed to fetch user {user_id}")))?;  // Now it compiles
    Ok(user)
}

The type system becomes your ally: it won’t let you be lazy at module boundaries.

In practice:

pub async fn execute(&self, task: Task) -> Result<Output, ExecutorError> {
    let make_error = || ExecutorError(format!("failed to execute task {}", task.id));

    let user = self.fetch_user(task.user_id)
        .await
        .or_raise(make_error)?;

    let result = self.process(user)
        .or_raise(make_error)?;

    Ok(result)
}

Every ? has context. When this fails at 3am, instead of the cryptic serialization error, you see:

failed to execute task 7829, at src/executor.rs:45:12
|
|-> failed to fetch user "John Doe", at src/executor.rs:52:10
|
|-> connection refused, at src/client.rs:89:24

Putting It Together

In real systems, you often need both: machine-readable errors for automated recovery, and human-readable context for debugging. The pattern: use a flat, kind-based error type (like Apache OpenDAL’s) for the structured data, and wrap it in a context-tracking mechanism for propagation.

// Machine-oriented: flat struct with status
pub struct StorageError {
    pub status: ErrorStatus,
    pub message: String,
}

// Human-oriented: propagate with context at each layer
pub async fn save_document(doc: Document) -> Result<(), Exn<StorageError>> {
    let data = serialize(&doc)
        .or_raise(|| StorageError::permanent("serialization failed"))?;

    storage.write(&doc.path, data)
        .await
        .or_raise(|| StorageError::temporary("write failed"))?;

    Ok(())
}

At the boundary, walk the error tree to find the structured error:

// Extract a typed error from anywhere in the tree
fn find_error<T>(exn: &Exn<impl Error>) -> Option<&T> {
    fn walk<T>(frame: &Frame) -> Option<&T> {
        if let Some(e) = frame.as_any().downcast_ref::<T>() {
            return Some(e);
        }
        frame.children().iter().find_map(walk)
    }
    walk(exn.as_frame())
}

match save_document(doc).await {
    Ok(()) => Ok(()),
    Err(report) => {
        // For humans: log the full context tree
        log::error!("{:?}", report);

        // For machines: find and handle the structured error
        if let Some(err) = find_error::<StorageError>(&report) {
            if err.status == ErrorStatus::Temporary {
                return queue_for_retry(report);
            }
            return Err(map_to_http_status(err.kind));
        }
        Err(StatusCode::INTERNAL_SERVER_ERROR)
    }
}

You do have to walk the tree—but compare that to the Provide/Request API. Here you’re searching for a concrete type, like StorageError: it has named fields, it’s documented, and your IDE can autocomplete it. No guesswork, no runtime surprises—just a well-defined struct you can understand and maintain.

Closing thought

Propagating errors is easy in Rust. Explaining them is the part we tend to postpone.

Next time you return a Result, take 30 seconds to ask: “If this fails in production, what would I wish the log said?” Then make it say that.

Resources

Last edited Jan 05