We know our developer productivity metrics are gamed, but we just don't know what to measure then

May 7, 2026

Contents

Key Takeaways

Most developer productivity metrics are flawed because they measure visible output (commits, story points, lines of code) instead of the judgment and decision-making that create real engineering value

Engineering is fundamentally cognitive work—strong developers create impact through tradeoffs, debugging, architecture decisions, mentorship, and preventing problems before they happen

The best indicators of engineering effectiveness are observable behaviors: structured thinking, explanation quality, tool fluency, error recovery, and sound technical judgment under ambiguity

Direct observation consistently outperforms dashboard metrics—watching how engineers debug, review code, communicate, and solve problems reveals far more than quantitative KPIs

The solution to gamed engineering metrics isn’t better metrics—it’s shifting from measuring proxies to evaluating real work, thought processes, and decision-making in realistic scenarios

You know your metrics are being gamed. Lines of code? Everyone writes bloated functions. Commits per day? Devs batch work artificially. Story points? Your team inflates estimates in sprint planning. You've seen the Goodhart's Law meme enough times to be numb to it. The real problem isn't that metrics get gamed—it's that you're measuring the wrong thing entirely. You're quantifying output when what actually matters is judgment.

The metric theatre that we're all performing

Here's what most engineering teams measure today:

Velocity: Story points completed per sprint
Throughput: PRs merged, tickets closed, features shipped
Code quality proxies: Test coverage, lint warnings, code review turnaround
Activity signals: Commits, lines changed, time in IDE

Every single one optimizes for motion, not progress. Your senior engineer who spent three days preventing a catastrophic architecture decision shows up as "low productivity" because they wrote 47 lines of code. Meanwhile, the junior who copy-pasted 2,000 lines from Stack Overflow looks like your star performer.

The uncomfortable truth: these metrics were never designed to measure good engineering. They were designed to give managers something to put in spreadsheets.

Why everything breaks down when you try to quantify thinking

The core issue is category error. Developer productivity isn't assembly line work—it's decision-making work.

Consider what actually makes an engineer valuable:

Choosing the right approach before writing code
Knowing when not to build something
Asking about constraints nobody mentioned
Explaining trade-offs to non-technical stakeholders
Debugging production issues by reasoning about system behavior
Preventing future technical debt through upfront design

None of this shows up in your Jira dashboard. None of it increases your GitHub contribution graph. A developer who says "we shouldn't build this feature because it'll create 18 months of maintenance burden" is performing an act of extraordinary value—and your metrics will score it as zero.

The observation problem: you can't measure what you can't see

Traditional productivity metrics fail because they measure artifacts (code, commits, tickets) rather than the cognitive work that produces good artifacts. It's like measuring a surgeon's performance by counting incisions instead of patient outcomes.

Here's what senior engineers actually do during a productive day:

Morning: Spent 90 minutes reading through a pull request, asked three clarifying questions about edge cases that would've caused production bugs

Midday: Pair-programmed with junior dev, taught them why their approach would scale poorly, walked through better patterns

Afternoon: Debugged flaky test, discovered root cause was race condition in CI pipeline, documented findings

Evening: Reviewed architecture proposal, identified missing observability concerns, suggested incremental rollout strategy

Total lines of code written: 12. Total commits: 0. Total story points: 0. Total value created: immeasurable.

What actually predicts engineering success

After watching hundreds of engineers work—not through metrics, but through actual observation—patterns emerge. The strongest engineers consistently demonstrate:

Structured thinking under ambiguity: They ask about requirements, constraints, and edge cases before touching code
Explanation quality: They can walk you through their decisions and trade-offs clearly
Tool fluency: They know when to use AI, when to read docs, when to grep the codebase
Error recovery: They don't panic when something breaks; they methodically narrow the problem space
Taste: They make good judgment calls about complexity, maintainability, and user impact

None of these are quantifiable in a dashboard. All of them are observable if you watch someone work.

The measurement you're actually looking for

Stop trying to metric-ify everything. Start observing behavior directly.

Instead of: Tracking commits per week

Do this: Watch how a developer approaches an unfamiliar bug. Do they read error messages carefully? Do they form hypotheses? Do they verify assumptions?

Instead of: Measuring code review speed

Do this: Read what developers write in code reviews. Are they catching logic errors? Suggesting better patterns? Teaching junior engineers?

Instead of: Counting completed tickets

Do this: Look at production incidents. Who prevents fires? Who debugs them effectively? Who writes the postmortem that actually prevents recurrence?

Instead of: Surveying for "engagement"

Do this: Notice who asks the right questions in architecture meetings. Who spots the constraints everyone else missed? Who simplifies complexity instead of adding to it?

This doesn't scale the way a Jira plugin scales. It requires judgment from technical leaders. That's the point—developer productivity is a judgment problem, not a counting problem.

The real answer nobody wants to hear

You can't measure developer productivity with a number because productivity isn't a quantity—it's a quality. The best proxy you have is technical leaders who've done the work spending time observing how people work, not what they produce.

Your metrics aren't broken because you chose the wrong KPIs. They're broken because you're trying to reduce human judgment to dashboards. The solution isn't better metrics. It's better observation.

When pilots are evaluated, they're put in flight simulators where instructors watch them fly. Not count how many buttons they press. Not measure their joystick velocity. They watch decision-making under realistic conditions.

Your hiring process should work the same way. Your performance evaluation should work the same way. If you're not watching people work—actually observing their thought process, decisions, and trade-offs in realistic scenarios—you're just counting proxies and wondering why they correlate poorly with actual engineering ability.

The answer to gamed metrics isn't different metrics. It's direct observation of the work itself.