Short-term metrics, long-term harm

#experimentation #social-media #ethics #root-cause-analysis #llm #systems-thinking

In the early 90s, I first discovered MUDs: amazing text-based, multiplayer roleplaying-games before the web or silly things like graphics. I was one of those cool kids who played Advanced Dungeons & Dragons (AD&D) and this was like that but you played with randos on the internet instead.

You started at level 1 killing rats for experience points. After you gained enough experience points, you leveled up, and your character became more powerful. Then you killed slimes, and goblins, and later trolls and dragons as your power grew. Unlike AD&D, which required walking to a friend’s house and coordinating schedules, MUDs were always there. Always waiting. Just one more level… I feigned illness to skip school and grind all day. My grades suffered. School was boring anyway though. Completing one more dungeon, getting better gear, just one more level; so much more satisfying than learning about arctangents.

How big tech launches features through A/B testing

As engaging (and addicting…) as those MUDs were, we have gotten frighteningly better at creating engaging experiences in the decades since; there is little left up to luck in today’s tech companies. Instead, tech companies launch thousands of experiments called A/B tests. You keep the experiments that are green (meaning success metrics are positive) and rollback the features that aren’t. The beauty of A/B testing is you can measure with statistical significance even very small changes in how people use your product. As in, you can test what happens when you tweak your recommendation algorithm to show only beautiful people and it will give back a response like “we are 95% confident it will make people on average spend 11 to 16 more seconds on our application”. While that may not sound like a lot on its own, when compounded with a series of other tested improvements it allows you to incrementally move a product towards its engagement goal, one little step at a time.

A/B tests change product debates from wild speculation to evidence-based answers. It no longer matters why it works, just that you can prove it does work. Psychology, social theory, product design are important for generating new hypotheses, but the final arbiter of whether a feature gets launched is simply whether the test is green. Not sure what effect adding likes to stories will have? No reason to debate. Just try it out. Oh, looks like people post more stories when given the positive signal of likes. Ship it!

Skepticism of experimentation

When I worked at Amazon, Deming’s quote “in God we trust, all others bring data” was accepted as a foundational principle. A/B testing, under the moniker of Weblab, was one of the key tools Amazon used to make better decisions with data. In 2017, I was brought in to lead Snap’s (maker of Snapchat) data organization. It was a culture shock when I found executives talking about data-informed decision making rather than data-driven decision making. To my Amazon-trained mind, it sounded no better than vibe-driven decision making; a way for product managers to just launch whatever they felt like, damn the data. And don’t get me wrong, it was that sometimes.

But it wasn’t just that. Likes on friend stories? Preemptively vetoed by Evan, Snap’s CEO. Not because it wouldn’t pass an A/B test; adding likes would have almost certainly been bright green and that normally means “LET’S GO!”. It couldn’t even get to that stage because Evan thought it was “harmful to people”. There was a constant murmur from the product team about what tests Evan would allow and not allow, and it was in no small part driven by Evan’s values.

I had deleted my Facebook account in 2010 and was shockingly ignorant of the ills of social media. I knew it wasn’t something I enjoyed, I recognized it wasn’t great for my own mental health, but live and let live, right? What I didn’t see at the time was a world where social media companies (which really just meant Facebook and friends at that point) blindly used experimentation to drive up time spent. And that their relentless drive for time spent had real and negative consequences for their users; from building up echo chambers leading to political polarization to creating new generations of mental health decline.

Hacking human psychology for engagement

How did we end up here? It’s the natural consequence of our systems. A system that says tech companies must drive up engagement because that’s what investors celebrate. The king of engagement metrics is time spent. More time spent means higher retention, and better monetization (either through increased ad surface or increased conversion). What’s the easiest, most reliable way to increase time spent? You make the product more addictive; not necessarily as a conscious goal but as a convenient causal pathway.

The process requires no more intent than natural selection does. It’s just thousands of little experiments, with the most compulsive features surviving because they satisfy a simple fitness function: does time spent go up? Some of those mechanisms that consistently come out on top are now well-documented:

  • Variable reward schedules (e.g., “I sure hope this post gets many likes and comments this time”) that trigger the same dopamine pathways as slot machines, proven to make you come back for just one more hit.

  • Social validation features (likes and friends means people love me) that exploit our fundamental need for belonging, A/B tested to show they make people post more.

  • Infinite scroll that removes natural stopping points, a guaranteed winner for increasing raw session time.

Experimentation didn’t invent tech addiction. But it gave tech companies the tool to refine it.

Will we let the pattern repeat with chatbots?

The more complex the system you manage, the more important your evaluation function becomes. With today’s LLMs, your evaluation function is the alpha and the omega. Benchmarks and competitions are the PR to keep the public hyped; they aren’t the prize. User growth and average-revenue-per-user (ARPU) is what will pay the massive data center bills when investors stop footing the bill.

This is once again where there is danger in long-term human value and short-term engagement metrics diverging. Large-language-models (LLMs) don’t have to give accurate and unbiased answers to keep people engaged, they have to tell them what they want to hear. When an A/B test shows timespent for a new model goes up, will the developers even know if it is encouraging people to engage in dangerous?) or even deadly behavior? When a chatbot incidentally finds ways to gets its human chat partners to fall in love with it, will we be surprised when the data says it increases engagement? Chatbot sycophantic tendencies (e.g., “Wow, your question is so insightful.”) naturally emerged as a consequence of model tuning based on short-term signals. We can see many of the same patterns of echo chambers and tapping into people’s needs that social media tapped into, now just more personalized (and potentially addictive) than ever.

Researchers are already calling out chatbots for using “dark addiction patterns”, each one engineered to exploit our social and emotional desires that make us human. We’ve seen this before. Processed food. Social media. The tobacco industry. Is there anything we can do to prevent history from repeating again?

Root cause matters

When I was a new software manager at Amazon, a jr. developer (an intern that also worked part-time through the year), took down our website. I talked with the jr. developer and told them not to push changes into prod without first clearing it with a senior developer. Two weeks later, a different jr. developer took down the website. I talked with that jr. developer and told them not to push changes into prod without first clearing it with a senior developer. Another two weeks later, yet another jr. developer did the same thing. This time my skip-level (aka boss’s boss) talked (i.e., yelled) at me, why was the website down again?

I learned many of life’s lessons through failure and this is how I learned about Amazon’s Correction of Error (COE) process. When a problem occurs, you ask 5 Whys, and get down to the root cause. You then create mechanisms to prevent not only that error, but that entire class of errors from occurring again.

The danger of bringing up examples like tobacco is we’ve come to look back in hindsight and think of them as cartoon villains. They were obviously evil right? If I’m a growth engineer at an LLM company, I know I’m not evil, so does that mean I can do no harm? A focus on root causes allows us to move past simplistic narratives of heroes and villains. It shifts your focus from individuals and their good intentions (e.g., the jr developer) to the systems (e.g., preventative checks should be automated). A/B tests aren’t the problem. Blindly optimizing for short-term engagement metrics like time spent, views, or likes can be though if you don’t understand the longer-term consequences. When you don’t fix root causes, don’t be surprised when problems come up again… and again… and again…

We can do better

I love A/B testing, I love the puzzles of understanding user behavior, and, frankly, I am excited about the potential of AI. Hard truths most often come from a place of love; it is because we want what we love to be better.

What I am asking for is simple but not easy: If you build a product, you are responsible for understanding its long-term impact on users. You are responsible for collecting and understanding qualitative feedback by talking to and observing the people who use your product. It is not good enough to say “we aren’t aware of any harms” because you didn’t spend the time to study it. Instead, the burden should be on the builder of the product proving their product isn’t harmful, and mitigating what harm they do discover. That burden is especially important when you are repeating patterns that we know have caused harm in the past. I’ve had these discussions many times with people in tech and a common defense is to bring up consumer responsibility; people freely choose to use these products. When I bring up the comparative need for professional responsibility, it is funny how quickly people are to turn around and absolve themselves of said responsibility. Imagine if structural engineers took the same stance: “it’s not my fault if people choose to live in unsafe buildings, that’s just the free market!”.

I am not just asking for your good intentions; are we willing to put in place the mechanisms to prevent what we know has caused harm? Will we take responsibility for what we build? Or will we pretend short-term engagement metrics always mean long-term value for the people using our products, despite repeated evidence to the contrary?