The Vanity Metric Paradox
In my time at as a software engineer I’ve seen teams and groups succeed and fail, grow or be re-organized. One of the things that team durable and stable over time, and helped them execute and deliver was have a clear set of metrics to optimize.
Optimizing metrics is easy. No, really, it is. I believe that having clear, well defined metrics is the single largest driver of progress within a company. When a team owns their metrics, they can make those metrics go “up-and-to-the-right”. The metrics bring clarity like nothing else. It’s a lot easier for us humans to think about ways to do the specific tasks of increase certain types of engagements or decrease negative events, than to think about the abstract goal of “making it all better”. Without a way of measuring what’s better, we are lost. This is not something I figured out, it is all well generally understood, and there is plenty of literature and resources on the topic.
Over a long period of time, a team with metrics will make a sizable progress towards those metrics. Rigorous analytics and experimentation can ensure that the product moves in the direction that metrics M imply.
However, there is a failure mode of this kind of optimization. It is the “vanity metric” paradox.
“Any metric sufficiently optimized becomes a vanity metric”
I’ll explain why this is generally true, and I will use the concept of “skin in the game” as defined by Nassim Taleb to further illustrate it. The concept of skin in the game is about exposing yourself to the downside of what you’re responsible for. Not just the positive reward if you get it done, but also negative if you fail. For example you’re a software engineer, you are liable to get fired if you don’t deliver quality code. If you’re in a leadership situation you are responsible if your team fails.
The premise is that the lack of skin in the game corrupts, as it incentivizes taking too many risks, because “red- I win, black- you lose”, especially in situations where the black is subtle and rare. In that case “you lose” is big and ruinous.
I’m also going to refer to the statement by Edward Tufte - “people and institutions cannot keep their own score accurately”. The way I understand this statement is that when there is a well defined score - say water pollution level, the people measuring the water should not be the one responsible, because they will fudge it.
These two concepts have interesting interplay especially as referring to software companies. In a software company, through time, when a bunch of smart engineers, product managers, designers, VPs and CEOs look at a product and think about the important ways to measure success, they look at how people use the product and how they extract value, and determine metrics which align with users extracting value and the company succeeding. The reasoning being that if more users extract value, then there’s more business to be done, and subsequently there will be more profits.
For simplicity lets assume we are talking about internal metrics - without an incentive to fudge this metric externally to get more funding or increase of the share price.
The company has skin in those metrics. If the company regresses on these metrics, it might go bankrupt. At this time, each team in the company which owns a metric, also has skin in that metric. If the team lets that metric deteriorate, it will be in trouble. At this early stage, there is alignment between the company as a whole, and the individual teams within it. Their skin is in the same metrics. Lets see how this alignment might break over time.
In a hypothetical scenario, lets imagine Unscrew Inc, a company which specializes in corkscrew openers. Geraldine, the CEO, might insist that the company measures how many corkscrews the company sells, and Preston, the CTO might insist on a metric of how long it takes for someone to open a bottle.
For months the engineering team is hard at work, and collaborating with the sales team, on a new version of the corkscrew. Instead of having two handles on the side, they would simplify the design, creating a corkscrew which is T shaped that provides more leverage the person opening the bottle. Let’s imagine a customer, called Ricardo, who likes to drink wine a couple times a week with his family. He’d like a simple, cheap and easy to use corkscrew. The new T shaped corkscrew fits all of these parameters, and is cheaper than the complicated dual-handle corkscrew he used before, and he buys one.
Unscrew Inc sees the sale, and that helps its metric of sales grow, and the engineering team is happy because they’ve tested the T-shaped corkscrew and it took subjects about 8 seconds to open a bottle compared to 13 with the dual-handle screw. Great success! Preston is proud and gives a tech talk on their technology innovation at ScrewConf. Geraldine is glad, as the simpler design works well for sales which are slightly up. She gives Preston and herself bonuses for job well done.
Let’s pause here. So far Unscrew Inc has been optimizing two metrics. Sales and ease of use. These two metrics make sense as the align customer value with success for the company. You can say that Unscrew Inc has skin in the game with these metrics, that is if these metrics drop, this is likely to hurt the company. So far, the bonuses seem justified. Good job Geraldine and Preston!
Now lets skip forward one year ahead. Preston’s team has started using a new type of plastic for the handle. The new plastic is more pliable which allows for finger indents and makes it easier for customers to hold the corkscrew. This means that now users take 7.5 seconds on average instead of 8. Preston publishes a blog post describing the 3D printing they used, twirls his vintage mustache and high fives his team-mates. The new corkscrew is cheaper due to the plastic, so for the first quarter after the introduction, sales are up too. Geraldine gives a company wide presentation, describing the bright future for the company, ensuring everyone that the grass will only get greener. More bonuses, corporate parties and balloons.
Some quarters later, the trend of sales seems to be going up until it eventually flat-lines. It turns out that the new plastic corkscrews are not robust and would break more often. Loyal customers came back to buy Unscrew’s corkscrews the first couple of times, but they no longer trust the brand. Meanwhile Preston’s team has been heads down in R&D and developed a hollow screw, which is cheaper and lighter, and takes 7 seconds to open a bottle of wine with it. The new screw is even more likely break often than the previous, but this is rarely seen in the lab, as the lab doesn’t test longevity, just ease of opening.
At this time, the customers such as Ricardo don’t care if they’ll save another half second every time they open a bottle. They are frustrated that the damn screw keeps on breaking and are going with the competitor brand for their future corkscrew needs.
It takes some time for other customers to come to the same conclusion. Meanwhile they’ve been buying more and more of the flimsy corkscrews, increasing sales. But eventually they are fed up as well and don’t care any more about Unscrew. They feel screwed, having wasted too much time and money on corkscrews which break too often.
The customers feel betrayed. The gossip spreads and suddenly no one wants to buy these corkscrews any more. Unscrew’s sales and stock price plummet, and Geraldine has to lay-off 40% of the staff after pressure from the board. Preston is fired too, and he goes to work for a screwdriver company as a VP of handle ergonomics, capitalizing on his expertise developed during the last year. Geraldine is under a lot of pressure and worried that she might be replaced by another CEO.
So… what happened. The metrics got over optimized, and some other necessary metrics were missing. The ease of opening was important to improve when they were building the clumsy dual-handle product. But later on, after switching to the sleek T-shaped one, ease of opening wasn’t as important. It turned from a metric that the company has skin in, to a vanity metric.
Let’s forget for Unscrew Inc for a bit, and get back to the general case. In the general case, a given set of metrics M is only important and vital, as long a certain assumptions A are true. We can say that M (ease of use) contains skin dependent on A (dual handle design is clumsy). M correlates with success and providing value as long as the assumptions A are true.
Assumptions tend change over time. Market conditions, customer habits, product evolution, etc. The team optimizing M usually isn’t privy to the assumptions A. And even if it is, they don’t care. Because it’s not in the OKRs or acknowledged during promotion cycles. Each employee on that team, whether high or low ranked has little incentive to understand and preserve A. So the team goes and optimizes M, to the point where the product has changed so much that the original assumptions A are no longer true. Once A are no longer true, then optimizing M would likely make the product worse. The company and the product no longer have skin in M.
But the teams keep on improving M, because it is in their roadmaps, and OKRs. For each team working on M, their skin is still in M. The manager and the employees are rewarded on how much they improve M and how many things they ship that will help improve M in the future. They don’t want M to change, because they would need to adapt, scrap project, increase their uncertainty, lose momentum, and risk getting fired or reorganized into different teams.
They’d rather keep running faster and faster in the wrong direction than take time and look at map, and convince their herd to re-orient and run away from the river full of crocodiles.
Building products is not science. What makes a good product is usually dependent on so many factors that are subject to change and evolve. On the other hand science is durable. A certain level of E.coli in water would be as toxic today for a human than it would have been hundred years ago, or it will be hundred years in the future.
That’s why product and company metrics need to evolve and be re-thought on a regular basis. I think every six months might be good.
Another example is financial performance. There are generally accepted accounting practices - GAAP. These have evolved over time, as corporate executives have figured out ways to game these metrics in order to make the companies appear better to investors, to increase the price, to get larger bonuses. If you see the evolution of GAAP you would see that the rate of change is increasing. This is inline with Edward Tufte’s claim that people and institutions can’t keep their own score. It is also inline with the vanity metric paradox. Financial instruments and investment opportunities are always changing, so GAAP need to evolve to keep pace.
Any type of performance needs to be adjusted over time. Any score is game-able, or at least prone to getting outdated. Any metrics sufficiently optimized becomes a vanity metric.
Even with best intentions, people and institutions need to recognize that their metrics can go bad. Metrics are still the best organizing force in a company, but then need to be regularly re-thought and updated.