Welcome to the Museum of Meaningless Metrics

I recently encountered a comic depicting the “Museum of Meaningless Metrics”.

Its exhibits included:

Lines of Code
Story Points
Pull Requests
Tokens Spent

The newest and proudest exhibit was a towering token counter, presumably recording enough artificial intelligence activity to heat a modest apartment block.

The comic is funny because it captures a recurring organisational habit: when something important is difficult to measure, we measure something nearby instead.

Productivity is difficult to measure. Value is difficult to measure. Quality, good judgement, avoided mistakes and long-term maintainability are all difficult to measure.

Counting things, however, is wonderfully easy.

And once a number appears on a dashboard, it acquires an air of authority. It becomes a target, a performance indicator and eventually a presentation slide with an upward-pointing green arrow.

The problem is not that these metrics contain no information. Most of them do. The problem begins when they are promoted from operational signals into proof of productive work.

Exhibit One: Lines of Code

To a non-programmer, lines of code can look like a straightforward measure of output.

One developer writes 500 lines. Another writes 5,000. Clearly, the second developer has done ten times as much work.

That conclusion is roughly as reliable as judging an author by the weight of the manuscript.

A large codebase may genuinely represent a substantial amount of work. A complex application cannot always be produced in 12 elegant lines and a motivational comment. But the number of lines tells us very little about whether the work was necessary, correct or well designed.

Those 5,000 lines could represent:

a major new system;
duplicated logic;
an unnecessarily complicated solution;
generated boilerplate;
poorly understood requirements;
several fixes for problems introduced by earlier fixes.

Meanwhile, an experienced developer might spend a day studying the problem and submit a 20-line change that eliminates the need for hundreds of additional lines.

That does not mean fewer lines are always better. Code can be compressed until it resembles an ancient curse. Readability, maintainability and testing matter more than winning a code-golf tournament.

The meaningful questions are not “How many lines were written?” but:

Does the software solve the intended problem?
Is it reliable?
Can somebody else understand it?
Can it be changed safely later?
Did it reduce or introduce complexity?

Lines of code can help estimate the size of a codebase. They are poor evidence of an individual developer’s productivity.

Exhibit Two: Story Points

I have participated in sprint-estimation exercises and have also been asked to assign story points.

In theory, story points provide a relative indication of a task’s size. They can incorporate effort, complexity, uncertainty and risk. A task assigned eight points is not necessarily expected to take eight hours or eight days. It is simply considered substantially larger or less predictable than a task assigned two points.

The Fibonacci-like sequence—1, 2, 3, 5, 8, 13 and so on—is intended to discourage fake precision. As tasks become larger, our confidence in estimating them usually decreases. The gap between the available numbers forces teams to acknowledge that uncertainty.

That is the theory.

In practice, the exercise can sometimes feel like a group of people cautiously plucking numbers from the sky while discreetly checking what everyone else has plucked.

Developers and requestors may view the same task very differently. One sees a minor form change. The other sees legacy code, undocumented dependencies, six integration points and a database that becomes emotionally unstable after 4 p.m.

Ideally, disagreement exposes hidden assumptions. Someone who estimates a task as two points may know something the person estimating eight does not. The discussion is supposed to produce shared understanding.

But the process breaks down when:

people anchor their estimates to the first number mentioned;
participants follow the majority to avoid defending a different view;
developers agree on the answer beforehand;
management treats points as hours in disguise;
velocity becomes a performance target;
teams inflate estimates to make their output appear larger.

At that stage, story points stop being a conversation aid and become ceremonial arithmetic.

The points themselves were never meant to be objective units. An eight-point task for one team cannot reliably be compared with an eight-point task for another. Even within the same team, the meaning may drift as the people, technology and work change.

Story points can help a stable team forecast its own capacity. They become dangerous when outsiders mistake them for a universal measurement of productivity.

Exhibit Three: Pull Requests

Pull requests are an important part of modern software development. They allow proposed changes to be reviewed, discussed, tested and improved before being merged into the main codebase.

That does not make the number of pull requests a meaningful productivity score.

A developer who submits many pull requests may be working in a disciplined way, breaking changes into small and reviewable units. Another developer might submit fewer pull requests because the work involves longer-term research, architecture or debugging.

Or the high count could come from:

repeated corrections;
failed implementations;
unnecessary fragmentation;
automated dependency updates;
trivial formatting changes;
fixes for problems introduced by earlier pull requests.

Conversely, a low number could represent careful, high-impact work—or a single 18,000-line pull request that causes every reviewer to suddenly develop an urgent dental appointment.

The count does not tell us whether the changes were useful, safe or even necessary.

Better questions include:

How quickly are useful changes delivered?
How often do changes introduce defects?
Are reviews substantive or merely ceremonial?
Are pull requests reasonably sized?
Does the team learn from incidents and review feedback?
Are customers or internal users seeing meaningful improvements?

Pull requests are a workflow mechanism. Counting them is like measuring a restaurant by the number of plates that passed through the kitchen.

Interesting, perhaps. Conclusive, no.

Exhibit Four: Tokens Spent

Now we arrive at the museum’s newest attraction: tokens spent.

As organisations adopt generative AI, token counts have become the latest impressively large numbers available for dashboards.

Someone used ten million tokens.

Excellent.

What happened?

Did those tokens produce a working application, complete useful research, resolve customer cases or automate a tedious process?

Or did an AI agent repeatedly inspect the same repository, misunderstand the task, rewrite its own plan six times and confidently announce that it had completed work for which it did not even possess the required credentials?

The token count cannot tell us.

Tokens are units of text processed or generated by an AI model. They matter for practical reasons:

expenditure;
capacity planning;
latency;
model selection;
efficiency optimisation;
detecting unexpectedly wasteful workflows.

But tokens spent are an input cost, not an output measure.

Using more tokens does not necessarily mean that more reasoning occurred. It may mean that the task was complex, the context was large or extensive validation was performed. It may also mean that the prompt was inefficient, the agent became stuck in a loop or the system generated vast quantities of polished nonsense.

Similarly, minimising tokens is not automatically desirable. Spending additional tokens to verify an important result may be entirely justified. A cheap but incorrect answer is not efficient. It is merely an error delivered at a discount.

The more meaningful measures depend on the objective:

Was the task actually completed?
Was the result accurate?
How much human correction was required?
How much time or money was saved?
Was the outcome independently verified?
Did the system avoid inventing actions, identifiers or results?
What was the cost per successful outcome?

“Tokens spent” might earn bragging rights among people with generous compute budgets. It does not, by itself, demonstrate useful work.

When the Metric Becomes the Mission

The deeper issue behind all four exhibits is not measurement itself. Measurement is necessary.

The trouble begins when a proxy becomes a target.

When developers are rewarded for lines of code, codebases grow.

When teams are rewarded for completing story points, estimates mysteriously expand.

When pull-request counts become visible performance measures, work gets divided into more pull requests.

When token usage is presented as proof of AI adoption, systems find remarkably creative ways to consume tokens.

People adapt to the measurement system placed around them. This is not necessarily dishonesty. It is often rational behaviour. If an organisation repeatedly signals that a number matters, employees will naturally optimise for that number.

Unfortunately, the number may become healthier while the actual outcome deteriorates.

What Should We Measure Instead?

There is no single perfect replacement metric. That is precisely why organisations keep returning to convenient counts.

Useful measurement normally requires a combination of quantitative data and human judgement.

Depending on the work, better indicators might include:

successful outcomes delivered;
user or customer impact;
reliability and defect rates;
lead time from request to usable result;
frequency and severity of production incidents;
rework required;
maintainability;
cost per successful task;
quality of documentation and knowledge transfer;
whether the original problem was actually solved.

Even these can be manipulated or misunderstood. No dashboard eliminates the need for context.

A metric should begin a conversation, not end one.

The Museum Is Still Expanding

Lines of code, story points, pull requests and tokens spent are not entirely useless. Each can answer a narrow operational question.

How large is this repository?

How does this team perceive the relative size of upcoming work?

How are changes flowing through the review process?

How much AI capacity are we consuming?

Those are legitimate questions.

The absurdity begins when the answers are repurposed to answer much larger questions:

Who worked hardest?

Which team is most productive?

Who created the most value?

Is our AI programme successful?

Numbers cannot answer those questions simply because they are available, countable and aesthetically compatible with a bar chart.

So perhaps the comic’s museum label is slightly unfair.

These are not always meaningless metrics.

They are metrics from which we frequently demand meaning they were never capable of providing.

And judging by the rate at which dashboards are being built, the museum gift shop should be extremely profitable.

Comments

comments