Why Observability Matters for MCP Servers

MCP servers are black boxes. When Claude calls your tools, you have no visibility into what’s happening. A user reports “Claude can’t read my file” – but which file? When did it fail? How often? Without observability, you’re debugging blind.

Consider a simple error. Your read_file tool tries to read a non-existent file. Without observability, you get a single log line:

Error executing tool read_file: ENOENT: no such file or directory

That’s it. No context, no frequency, no pattern detection. You don’t know if this is affecting one user or hundreds.

With observability, that same error becomes actionable intelligence. Sentry captures the full stack trace with the exact file path, user session, and breadcrumbs showing what led to the error. OpenTelemetry creates a distributed trace in Jaeger showing the complete request flow, from Claude’s call through your server to the exact filesystem operation that failed. You see not just that it failed, but when, how often, and what else was happening at the time.

What Are The Challenges of MCP Development?

MCP servers face unique debugging challenges that traditional APIs don’t:

Indirect Invocation: You don’t control when or how Claude calls your tools. A user might ask Claude to “organize my project files,” triggering a cascade of file operations you never tested together.
Variable Load Patterns: One moment your server is idle, the next it’s processing hundreds of tool calls as Claude works through a complex task. Without metrics, you won’t know if that timeout was a one-off or happening every time.
Silent Failures: When a tool fails, Claude might gracefully work around it or hallucinate a response. Users may not even know something went wrong until much later.
Performance Mysteries: A user complains that “Claude is slow when working with my files.” Is it your read_file tool taking 5 seconds? Network latency? Claude itself? Without tracing, you’re guessing.

Let’s see the difference observability makes with a real example. Here’s what happens when the read_file tool tries to read a non-existent file:

// Your MCP server logs (if you're lucky):
Error executing tool read_file: ENOENT: no such file or directory

That’s it. One line in a log file somewhere. You don’t know:

What file was it trying to read?
Who requested it?
When did this happen?
Has it happened before?
What was Claude trying to accomplish?

Here’s what the user sees in Claude:

I encountered an error reading the file. Let me try another approach...

The error is swallowed, the context is lost, and you have no idea this even happened unless someone complains.

If we add in observability, we can use Sentry to get a report like this:

Sample report from Sentry

That’s quite a lot more than a single line in a log file. By adding observability, we now have:

The exact file path that failed: /definitely/not/a/real/file.txt - not just “file not found”
Error frequency: This happened 2 times in the last 5 hours, not a one-off issue
Full stack trace: Pinpoints line 15 in fileOperations.ts where fs.stat(path) failed
User context: Request came from Charles Town, United States, using Node v23.11.0 on macOS
Breadcrumbs: Shows “Starting read_file” was logged at 04:43:43 PM before the failure
System state: 476.9 MB memory usage at time of error
Patterns: Visual timeline shows if this error is increasing, decreasing, or sporadic

This is actionable intelligence. You know it’s a recurring issue (not a fluke), affecting real users (with geographic distribution), with exact reproduction steps (the file path), and full system context (Node version, OS, memory state). You could reproduce this error immediately, understand its impact, and track whether your fix resolves it.

That’s the power of observability versus a single log line!

How Do You Add Observability to MCP Servers?

How are we getting this information? We have instrumented our MCP in two ways:

Errors (via Sentry) tell you what broke. Not just “file not found” but which file, for which user, how many times, and whether it’s getting worse. The example code automatically captures every exception with full context.
Traces (via OpenTelemetry/Jaeger) tell you how requests flow through your system. You can follow a request from Claude through every function call, seeing where time is spent and where failures occur. A slow response? The trace shows you exactly which operation is the bottleneck.

The naive approach to adding observability would be to sprinkle monitoring code throughout every tool. You’d add Sentry error tracking here, OpenTelemetry spans there, metrics somewhere else. Your business logic would drown in instrumentation code, and you’d inevitably miss spots or handle them inconsistently.

Instead, our architecture uses a single wrapper pattern, withObservability, that automatically instruments any function. Here’s how we transform a tool:

// Before: Just business logic
async function readFile(args: ReadFileArgs) {
  const stats = await fs.stat(args.path);
  if (!stats.isFile()) {
    throw new Error(`Path ${args.path} is not a file`);
  }
  const content = await fs.readFile(args.path, args.encoding);
  return { path: args.path, content, size: stats.size };
}

// After: Same logic, now with complete observability
export const readFile = withObservability(
  "read_file",
  async (args: ReadFileArgs) => {
    // Exact same business logic, untouched
    const stats = await fs.stat(args.path);
    if (!stats.isFile()) {
      throw new Error(`Path ${args.path} is not a file`);
    }
    const content = await fs.readFile(args.path, args.encoding);
    return { path: args.path, content, size: stats.size };
  },
  { tool_type: "file_operation" } // Context for filtering
);

The beauty of this pattern is that your tool doesn’t know it’s being monitored. The wrapper intercepts execution, starts traces, captures errors, records metrics, and cleans up - all transparently. If you add a new observability system tomorrow, you update one wrapper function, not dozens of tools.

How Do You Set Up Sentry for Error Intelligence?

Sentry transforms cryptic errors into actionable intelligence. Let’s walk through how it’s integrated into our MCP server.

First, the initialization in src/observability/sentry.ts:

export function initSentry() {
  const dsn = process.env.SENTRY_DSN;
  if (!dsn) {
    console.warn("SENTRY_DSN not configured, skipping Sentry initialization");
    return;
  }

  Sentry.init({
    dsn,
    environment: process.env.SENTRY_ENVIRONMENT || "development",
    tracesSampleRate: 1.0, // Capture 100% of transactions for development
  });
}

The DSN (Data Source Name) is your project’s unique identifier in Sentry. It looks like https://abc123@o456.ingest.sentry.io/789 and tells the SDK where to send errors. Without it, Sentry gracefully skips initialization - your server still works, just without error tracking.

The magic happens inside our withObservability wrapper:

export function withObservability<T extends any[], R>(
  name: string,
  fn: (...args: T) => Promise<R>,
  context?: Record<string, any>
) {
  return async (...args: T): Promise<R> => {
    const sentryTransaction = startTransaction(name, "function");

    try {
      // Add breadcrumb for debugging timeline
      addBreadcrumb(`Starting ${name}`, "execution", context);

      // Execute the actual tool
      const result = await fn(...args);

      // Mark transaction as successful
      sentryTransaction?.setStatus("ok");
      return result;
    } catch (error) {
      // Capture error with full context
      captureError(error as Error, {
        function: name,
        args, // Include the actual arguments that caused the error
        ...context,
      });
      sentryTransaction?.setStatus("internal_error");
      throw error;
    } finally {
      sentryTransaction?.finish();
    }
  };
}

Notice how we capture not just the error, but the entire context - function name, arguments, and any additional metadata. This is what transforms “ENOENT” into “read_file failed for /definitely/not/a/real/file.txt”.

When our read_file tool hits that non-existent file, here’s the cascade of events:

The fs.stat() call throws an ENOENT error
Our wrapper catches it and calls captureError()
Sentry receives the error with full context
The error is re-thrown (important - we don’t swallow errors)

But Sentry is only half the story. It tells you what broke, but not how the request flowed through your system. For that, we need distributed tracing.

How Can You Use OpenTelemetry and Jaeger to Trace MCP Tool Use?

While Sentry captures errors, OpenTelemetry (OTel) traces show the complete request journey. Our implementation exports traces to Jaeger for visualization.

Here’s how we set it up in src/observability/opentelemetry.ts:

export function initOpenTelemetry() {
  const serviceName =
    process.env.OTEL_SERVICE_NAME || "mcp-observability-server";
  const serviceVersion = process.env.OTEL_SERVICE_VERSION || "1.0.0";
  const jaegerEndpoint =
    process.env.OTEL_EXPORTER_JAEGER_ENDPOINT ||
    "http://localhost:14268/api/traces";

  // Configure Jaeger exporter for traces
  const jaegerExporter = new JaegerExporter({
    endpoint: jaegerEndpoint,
  });

  // Configure Prometheus exporter for metrics
  const prometheusExporter = new PrometheusExporter({
    port: 9464,
  });

  // Create and configure the SDK
  sdk = new NodeSDK({
    resource: new Resource({
      [SEMRESATTRS_SERVICE_NAME]: serviceName,
      [SEMRESATTRS_SERVICE_VERSION]: serviceVersion,
    }),
    traceExporter: jaegerExporter,
    metricReader: prometheusExporter,
    instrumentations: [getNodeAutoInstrumentations()],
  });

  sdk.start();
}

The getNodeAutoInstrumentations() automatically instruments Node.js built-ins like fs and http. This means we get spans for filesystem operations and HTTP calls without writing any code.

In our wrapper, OpenTelemetry creates a span that wraps the entire tool execution:

return tracer.startActiveSpan(name, async (span) => {
  try {
    span.setAttributes({
      "function.name": name,
      ...context, // tool_type, etc.
    });

    const result = await fn(...args);
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    recordException(span, error as Error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: (error as Error).message,
    });
    throw error;
  } finally {
    span.end();
  }
});

When read_file fails, the trace in Jaeger tells a different story than Sentry:

Failed read_file tool call trace

You can click into any span for details:

Tags: function.name: read_file, tool_type: file_operation, error: true
Logs: Exception event with full stack trace
Process: Node.js version, OS, service name

Now let’s see what success looks like. When the send_webhook tool successfully posts data:

Successful tool call from send_webhook

This trace reveals performance insights, such as how long the send_webhook call took (437.53ms)

The performance bottleneck is clearly the external service, not our code. Without tracing, you might waste time optimizing the wrong thing.

The true power emerges when you connect Sentry and Jaeger. Notice in our Sentry error report, there’s a trace ID: 77c89a4bc8d544f4ecd71872c4f7b506. Click it, and you jump directly to the trace for that exact error.

Now you have both perspectives:

Sentry: What broke, how often, who’s affected
Jaeger: How the request flowed, where time was spent, what succeeded before the failure

This bi-directional integration happens automatically because both systems receive the same trace context from our wrapper.

Running the Stack Locally

To see this in action, spin up Jaeger with Docker:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 14268:14268 \
  jaegertracing/all-in-one:latest

Configure your .env:

SENTRY_DSN=https://your-key@sentry.io/your-project
OTEL_SERVICE_NAME=my-mcp-server
OTEL_EXPORTER_JAEGER_ENDPOINT=http://localhost:14268/api/traces

Now every tool call generates rich telemetry. An error triggers alerts in Sentry while successful calls build performance baselines in Jaeger. You’re no longer flying blind - every interaction with Claude leaves a trail of actionable data.

What Are The Next Steps For MCP Server Observability?

We’ve covered error tracking and distributed tracing, but this is just the foundation. Here are the next steps for MCP observability:

Structured Logging with Winston

Our current implementation uses basic log statements. The next evolution is structured logging with a library like Winston:

import winston from "winston";

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || "info",
  format: winston.format.json(),
  defaultMeta: { service: "mcp-server" },
  transports: [
    new winston.transports.File({ filename: "error.log", level: "error" }),
    new winston.transports.File({ filename: "combined.log" }),
    new winston.transports.Console({
      format: winston.format.simple(),
    }),
  ],
});

// Now instead of console.log, you get rich structured logs:
logger.info("Tool executed", {
  tool: "read_file",
  duration: 45,
  path: "/config/settings.json",
  correlationId: "abc-123",
  userId: "user@example.com",
});

Structured logs are queryable. You can search for all logs from a specific user, all failures of a particular tool, or trace a request across multiple services using correlation IDs. When combined with log aggregation services like Datadog or CloudWatch, you can build dashboards, set up alerts, and debug complex issues that span multiple tool invocations.

AI-Specific Metrics and Evals

Traditional observability tells you if a tool worked, but not if it was useful. Tools like Braintrust add an AI evaluation layer. Imagine tracking:

Tool choice patterns - which combinations of tools does Claude use for different tasks?
Success scoring - did the tool output actually help complete the user’s request?
A/B testing - compare different tool implementations to see which performs better
Cost tracking - monitor token usage and API costs per tool invocation

This transforms observability from “did it work?” to “did it work well?”

Performance Profiling

Our current setup tracks timing, but not resource usage. The next level adds:

CPU and memory profiling per tool execution
Detecting memory leaks in long-running MCP servers
Identifying N+1 query patterns when tools make database calls
Heat maps showing which code paths are hot spots

The Node.js --inspect flag combined with tools like Clinic.js can reveal performance issues that traces alone might miss.

Client Attribution

Right now, all requests look the same. But adding client attribution would let you:

Track which Claude Projects or conversations generate the most load
Identify power users who might need dedicated resources
Debug issues specific to certain client configurations
Build usage analytics to understand how your tools are actually used

This requires passing client context through the MCP protocol, perhaps as custom headers or attributes.

MCP Inspector: The Gateway Drug

If all this feels overwhelming, start with MCP Inspector. It’s the “Chrome DevTools” for MCP servers - no configuration, instant visibility. Add one line to your server:

await mcpInspector.initialize(server);

Open http://localhost:5173 and watch every request and response in real-time. It won’t give you historical data or alerting, but it’s perfect for development and debugging. Think of it as training wheels - once you see the value of observability, you’ll naturally want the full stack.

Observability: The Foundation of Reliable AI Systems

Observability allows you to build AI systems you can trust in production. MCP servers are the bridge between AI and your systems. They’re too critical to run blind. With proper observability, you can:

Fix issues before users report them
Optimize the right bottlenecks
Understand how Claude actually uses your tools
Build confidence that your AI integrations work reliably

The pattern we’ve explored, wrapping tools with automatic instrumentation, scales well. Start with one wrapper function. Add Sentry for errors. Layer in OpenTelemetry for tracing. You can find the code in this repository.

Start simple. Add the wrapper. Deploy Sentry. Run Jaeger locally. Watch as mysteries become insights, and insights become improvements. Your future self - the one debugging production issues - will thank you.